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Executive  Summary 


Rotorcraft  and  other  military  and  commercial  aircraft  rely  increasingly  on  complex  and  highly 
integrated  hardware  and  software  systems  for  safe  and  successful  mission  operation  as  they  un¬ 
dergo  migration  from  federated  systems  to  Integrated  Modular  Avionics  (IMA)  architectures.  The 
current  software  engineering  practice  of  “build  then  test”  is  proving  unaffordable;  software  costs 
for  the  latest  generation  commercial  aircraft,  for  example,  are  expected  to  exceed  $10B  [Feiler 
2009a]  despite  the  use  of  process  standards  and  best  practices  and  the  incorporation  of  a  safety 
culture.  In  particular,  embedded  software  responsible  for  system  safety  and  reliability  is  experi¬ 
encing  exponential  growth  in  complexity  and  size  [Leveson  2004a,  Dvorak  2009],  making  it  a 
challenge  to  qualify  and  certify  the  systems  [Boydston  2009]. 

The  U.S.  Army  Aviation  and  Missile  Research  Development  and  Engineering  Center  (AMRDEC) 
Aviation  Engineering  Directorate  (AED)  has  funded  the  Carnegie  Mellon®  Software  Engineering 
Institute  (SEI)  to  develop  a  reliability  validation  and  improvement  framework.  The  purpose  of  this 
framework  is  to  provide  a  foundation  for  addressing  the  challenges  of  qualifying  increasingly 
software-reliant,  safety-critical  systems.  It  aims  to  overcome  the  limitations  of  current  reliability 
engineering  approaches,  leverage  the  best  emerging  engineering  technologies  and  practices  to 
complement  the  process  focus  of  current  practice,  find  acceptance  in  industry,  and  lead  to  a  new 
set  of  reliability  improvement  metrics.  In  this  report,  we 

•  summarize  the  findings  of  the  background  research  for  the  framework  in  terms  of  key  chal¬ 
lenges  in  qualifying  safety-critical,  software-reliant  systems 

•  discuss  an  engineering  framework  for  reliability  validation  and  improvement  that  integrates 
several  engineering  technologies 

•  outline  a  new  set  of  metrics  that  focus  on  cost-effective  reliability  improvement 

•  describe  opportunities  to  leverage  ongoing  industry  and  standards  efforts  and  potential  fol- 
low-on  activities  specific  to  the  U.S.  Army,  to  accelerate  adoption  of  the  changes  in  engineer¬ 
ing  and  qualification  practice  described  above 

Reliability  engineering  has  its  roots  in  hardware  reliability  assessment  that  uses  historical  data 
from  slowly  evolving  system  designs.  Hardware  reliability  is  a  function  of  time,  accounting  for 
the  wear  of  mechanical  parts.  In  contrast,  software  reliability  is  primarily  driven  by  design  de¬ 
fects — ^resulting  in  a  failure  distribution  curve  that  does  not  adhere  to  the  bathtub  curve  common 
for  physical  systems.' 

Often  the  reliability  of  the  software  is  assumed  to  be  perfect  and  to  behave  deterministically — ^that 
is,  to  produce  the  same  result  given  the  same  input  [Goodenough  2010].  Therefore,  the  focus  in 
software  development  has  been  on  testing  to  discover  and  remove  bugs  using  various  test  cover¬ 
age  metrics  to  determine  test  sufficiency.  However,  time-sensitive  software  component  interact- 


Carnegie  Mellon  is  registered  in  the  U.S.  Patent  and  Trademark  Office  by  Carnegie  Mellon  University. 

The  bathtub  curve  consists  of  three  parts:  a  decreasing  failure  rate  (of  early  failures),  a  constant  failure  rate  (of 
random  failures),  and  an  increasing  failure  rate  (of  wear-out  failures)  over  time.  For  more  information,  go  to 
http://en.wikipedia.org/wiki/Bathtub_curve. 
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tions  may  encounter  race  conditions,  unexpected  latency  jitter,  and  unanticipated  resource  conten¬ 
tion — all  of  which  can  occur  randomly.  In  attempts  to  predict  sufficiency,  engineers  have  used  a 
failure-probability  density  function  based  on  code  metrics  such  as  source  lines  of  code  and  cy- 
clomatic  (or  conditional)  complexity.  However,  this  function  is  not  a  good  measure  of  system- 
level  interaction  complexity  and  nonfunctional  properties  such  as  performance  or  safety. 

Insertion  of  corrections  into  operational  systems  to  address  software-related  problems  requires 
recertification.  Frequently,  operational  work-arounds  must  be  accepted  in  lieu  of  correction  due  to 
high  recertification  cost.  As  a  result,  operators  spend  up  to  75%  of  their  time  performing  work¬ 
arounds.  Clearly,  a  need  exists  to  improve  system  recertification. 

As  with  hardware,  a  reliability  improvement  program  for  software-reliant  systems  is  needed  that 
includes  modeling,  analysis,  and  simulation.  This  type  of  improvement  program  can  identify  de¬ 
sign  defects  before  a  system  is  built  and  design  for  robustness  to  counter  unplanned  usage  and 
hazard  conditions  [Goodenough  2010]. 

Studies  by  the  National  Research  Council  [Jackson  2007],  NASA  [Dvorak  2009],  the  European 
Space  Agency  (ESA)  [Conquet  2008],  the  Aerospace  Vehicle  Systems  Institute  (AVSI)  [Feiler 
2009a],  and  AED  [Boydston  2009]  have  identified  four  key  technologies  in  addressing  these  chal¬ 
lenges: 

1 .  specification  of  system  and  software  requirements  in  terms  of  both  a  mission-critical  system 
perspective  (function,  behavior,  performance)  and  safety-critical  system  perspective  (relia¬ 
bility,  safety,  security)  in  the  context  of  a  system  architecture  to  allow  for  completeness  and 
consistency  checking  as  well  as  other  predictive  analyses 

2.  architecture-centric,  model-based  engineering  using  a  model  representation  with  well- 
defined  semantics  to  characterize  the  system  and  software  architectures  in  terms  of  intended 
(managed)  interactions  between  system  components,  including  interactions  among  the  phys¬ 
ical  system,  the  computer  system,  and  the  embedded  software  system.  When  annotated  with 
analysis-specific  information,  the  model  becomes  the  primary  source  for  incremental  valida¬ 
tion  with  consistency  along  multiple  analysis  dimensions  through  virtual  integration. 

3.  use  of  static  analysis  in  the  form  of  formal  methods  to  complement  testing  and  simulation  as 
evidence  of  meeting  mission  and  safety-criticality  requirements.^  Analysis  results  can  vali¬ 
date  completeness  and  consistency  of  system  requirements,  architectural  designs,  detailed 
designs,  and  implementation  and  ensure  that  requirements  and  design  constraints  are  met 
early  and  throughout  the  life  cycle. 

4.  use  of  system  and  software  assurance  throughout  the  development  life  cycle  to  provide  justi¬ 
fied  confidence  in  claims  supported  by  evidence  that  mission  and  safety-criticality  require¬ 
ments  have  been  met  by  the  system  design  and  implementation.  Assurance  cases  systemati¬ 
cally  manage  such  evidence  (e.g.,  reviews,  static  analysis,  and  testing)  and  take  into 
consideration  the  context  and  assumptions. 

Research  and  industry  initiatives  are  integrating  and  maturing  these  technologies  into  improved 


In  this  report,  we  group  requirements  into  mission  requirements  (operation  under  nominal  conditions)  and  safe¬ 
ty-criticality  requirements  (operation  under  hazardous  conditions)  rather  than  the  more  traditional  grouping  of 
functional  and  nonfunctional  requirements.  See  Section  3.1  for  more  detail. 
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software-reliant  system  engineering  practice.  The  SAE  International  Architecture  Analysis  and 
Design  Language  (AADL)  standard  has  drawn  on  research  funded  by  the  Defense  Advanced  Re¬ 
search  Project  Agency  (DARPA)  in  architecture  description  languages  (ADLs)  [SAE  2004- 
2012].The  Automated  proof-based  System  and  Software  Engineering  for  Real-Time  applications 
(ASSERT)  initiative  was  led  by  the  European  Space  Agency  (ESA)  and  focused  on  representing 
two  families  of  satellite  architectures  in  AADL,  validating  them,  and  generating  implementations 
from  the  validated  architectures  [Conquet  2008].  The  European  Support  for  Predictable  Integra¬ 
tion  of  mission  Critical  Embedded  Systems  (SPICES)  initiative  integrated  AADL  with  formalized 
requirement  specification,  the  Common  Object  Request  Broker  Architecture  (CORE A)  Compo¬ 
nent  Model  (CCM),  and  SystemC.  The  result  was  an  engineering  framework  for  formal  analysis 
and  generation  of  implementations  [SPICES  2006].  The  Correctness,  Modeling,  and  Performance 
of  Aerospace  SystemS  (COMPASS)  project  focused  on  a  system  and  software  co-engineering 
approach  through  a  coherent  set  of  specification  and  analysis  techniques  to  evaluate  correctness, 
safety,  dependability,  and  performability  in  aerospace  systems  [COMPASS  2011].  The  System 
Architecture  Virtual  Integration  (SAVI)  initiative  led  by  the  international  aircraft  industry  consor¬ 
tium  called  Aerospace  Vehicle  Systems  Institute  (AVSI)  is  maturing  and  putting  into  practice  an 
architecture-centric,  model-based  engineering  approach.  Using  a  single-truth  model  reposito¬ 
ry/bus  based  on  AADL,  this  approach  uncovers  problems  in  the  system  and  embedded  sottware 
system  early  in  the  life  cycle  to  address  exponential  development  and  qualification  cost  growth 
[Feiler  2009a].  In  particular,  the  SAVI  initiative  provides  an  opportunity  of  leveraged  cooperation 
[Redman  2010]. 

Applied  throughout  the  life  cycle,  reliability  validation  and  improvement  leads  to  an  end-to-end 
Virtual  Upgrade  Validation  (VUV)  approach  [DeNiz  2012].  This  approach  builds  the  argument 
and  evidence  for  sufficient  confidence  in  the  system  throughout  the  life  cycle,  concurrent  with 
development.  The  framework  keeps  engineering  efforts  focused  on  high-risk  areas  of  the  system 
architecture  and  does  so  in  a  cost-saving  manner  through  early  discovery  of  system-level  prob¬ 
lems  and  resulting  rework  avoidance  [Feiler  2010].  In  support  of  qualification,  the  assurance  evi¬ 
dence  is  collected  throughout  the  development  life  cycle  in  the  form  of  formal  analysis  of  the  ar¬ 
chitecture  and  design  combined  with  testing  the  implementation. 

The  architecture-centric  framework  provides  a  basis  for  a  reliability  validation  and  improvement 
program  of  software-reliant  systems  [Goodenough  2010].  Building  software-reliant  systems 
through  an  architecture-centric,  model-based  analysis  of  requirements  and  designs  allows  the  dis¬ 
covery  of  system-level  errors  earlier  in  the  life  cycle  than  system  integration  time,  when  the  ma¬ 
jority  of  such  errors  are  currently  detected. 

The  framework  also  provides  the  basis  for  a  set  of  metrics  that  can  drive  cost-effective  reliability 
validation  and  improvement.  These  metrics  address  shortcomings  in  statistical  fault  density  and 
reliability  growth  metrics  when  applied  to  software.  They  are  architecture-centric  metrics  that 
focus  on  a  major  source  of  system-level  faults:  namely  requirements,  system  hazards,  and  archi¬ 
tectural  system  interactions.  They  are  complemented  by  a  qualification-evidence  metric  that 
(1)  is  based  on  assurance  case  structures,  (2)  leverages  the  DO-178B  model  of  qualification  crite¬ 
ria  of  different  stringency  for  different  criticality  levels,  and  (3)  takes  into  account  the  effective¬ 
ness  of  various  evidence-producing  validation  methods  [FAA  2009a]. 


SAE  international  was  formerly  known  as  the  Society  of  Automotive  Engineers. 
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The  effects  of  acting  on  this  early  discovery  are  reduced  error  leakage  rates  to  later  development 
phases  (e.g.,  residual  defect  prediction  through  the  Constructive  QUALity  MOdel  COQUALMO 
[Madachy  2010])  and  major  system  cost  savings  through  rework  and  retest  avoidance  (e.g.,  Feiler 
retum-on-investment  study'').  We  can  leverage  these  cost  models  to  guide  the  cost-effective  appli¬ 
cation  of  appropriate  validation  methods. 


4  Peter  Feiler,  Jorgen  Hansson,  Steven  Helton.  ROI  Analysis  of  the  System  Architecture  Virtual  Integration  Initia¬ 
tive.  Software  Engineering  Institute,  Carnegie  Mellon  University.  To  be  published. 
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Abstract 


Software-reliant  systems  such  as  rotorcraft  and  other  aircraft  have  experienced  exponential 
growth  in  software  size  and  complexity.  The  current  software  engineering  practice  of  “build  then 
test”  has  made  them  unaffordable  to  build  and  quality.  This  report  discusses  the  challenges  of 
qualifying  such  systems,  presenting  the  findings  of  several  government  and  industry  studies.  It 
identifies  several  root  cause  areas  and  proposes  a  framework  for  reliability  validation  and  im¬ 
provement  that  integrates  several  recommended  technology  solutions:  validation  of  formalized 
requirements;  an  architecture-centric,  model-based  engineering  approach  that  uncovers  system- 
level  problems  early  through  analysis;  use  of  static  analysis  for  validating  system  behavior  and 
other  system  properties;  and  managed  confidence  in  qualification  through  system  assurance.  This 
framework  also  provides  the  basis  for  a  set  of  metrics  for  cost-effective  reliability  improvement 
that  overcome  the  challenges  of  existing  software  complexity,  reliability,  and  cost  metrics. 
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1  Introduction 


Rotorcraft  and  other  military  and  commercial  aircraft  rely  increasingly  on  complex  and  highly 
integrated  hardware  and  software  systems  for  safe  and  successful  mission  operation.  Traditional¬ 
ly,  avionics  systems  consisted  of  a  federated  set  of  dedicated  analog  hardware  boxes,  each  provid¬ 
ing  different  functionality,  and  exchange  of  physical  signals.  Over  time  avionics  systems  evolved 
to  using  digital  implementation  of  the  functions  through  periodic  processing  by  embedded  soft¬ 
ware  and  exchange  of  digital  signals  through  a  predictable  periodic  communication  medium  such 
as  MIL-STD  1553B.  The  next  step  included  the  (1)  migration  to  an  Integrated  Modular  Avionics 
(IMA)  architecture,  in  which  the  embedded  software  is  sometimes  modularized  into  partitions 
with  interactions  limited  to  port-based  communication  and  the  (2)  deployment  of  the  software  on 
a  common  distributed  computer  platform.  In  this  evolution,  the  role  of  embedded  software  has 
grown  from  providing  the  functionality  of  individual  system  components  to  integrating,  coordi¬ 
nating,  and  managing  system-level  capabilities  to  meet  mission  and  safety-criticality  require¬ 
ments.^ 

1 .1  Reliability  Assessment 

In  keeping  with  this  growing  complexity,  the  qualification  and  reliability  assessment  of  these  sys¬ 
tems  has  become  increasingly  challenging  within  budget  and  schedule  [Boydston  2009].  Current 
practice  relies  on  process  standards,  best  practices,  and  a  safety  culture.®  The  phases  of  a  tradi¬ 
tional  software  development  process  are  typically  shown  as  a  software  development  V  chart  (see 
Figure  1).  The  downward  portion  of  the  V  puts  an  emphasis  on  development  complemented  by 
design  and  code  reviews;  the  upward  portion  focuses  on  testing  complemented  by  managing  build 
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Figure  1:  Traditional  Phases  of  Software  Development 


See  Section  3.1  for  a  definition. 

Some  of  the  standards  and  practices  are  the  Capability  Maturity  Model  Integration®  (CMMI®),  MIL-STD-882, 
SAE  ARP4754,  SAE  ARP4761,  DO-178B,  DO-254,  UK  00-56,  lEC/ISO  15026,  lEC  61508,  and  ARINC653. 
(®Capability  Maturity  Model  Integration  and  CMMI  are  registered  in  the  U.S.  Patent  and  Trademark  Office  by 
Carnegie  Mellon  University.) 
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and  deployment  configurations.  The  process  has  evolved  from  a  waterfall  model  of  strict  sequenc¬ 
ing  of  the  phases  to  spiral  development,  in  which  developers  iterate  through  the  phases  in  order  to 
refine  design. 

The  “build  then  test”  approach  illustrated  in  the  V  chart  is  becoming  unaffordable  for  aircraft  and 
rotorcraft  development.  For  example,  software  costs  for  the  latest  generation  commercial  aircraft 
are  reaching  $10  billion  [Feiler  2009a],  with  software  making  up  two  thirds  of  the  total  system 
cost.  Furthermore,  the  separation  of  system  and  software  engineering  in  the  traditional  process  has 
led  to  shortcomings  in  the  delivery  of  system  nonfunctional  requirements  [Boehm  2006].  Embed¬ 
ded  software  in  aircraft  is  increasingly  responsible  for  the  safety  and  reliability  of  aircraft  system 
operation  [Leveson  2004a,  GAO  2008].  This  software  is  also  experiencing  exponential  growth  in 
size  and  complexity  [Feiler  2009a,  Dvorak  2009],  making  it  a  challenge  to  qualify  and  certify. 

Reliability  engineering,  as  practiced,  has  its  roots  in  the  use  of  statistical  techniques  to  assess  the 
hardware  reliability  of  a  slowly  evolving  system  design  and  an  operational  system  affected  by 
wear  and  aging  over  time.  Software  reliability  differs  from  hardware  reliability  in  that  it  is  primar¬ 
ily  driven  by  design  defects.  Software  evolves  quite  rapidly  and  corrections  are  effectively  design 
changes.  As  a  result  its  failure  distribution  curve  does  not  adhere  to  the  bathtub  curve^  common 
for  physical  systems. 

Often  the  reliability  of  the  software  is  assumed  to  be  perfect  and  to  behave  deterministically  (i.e., 
to  produce  the  same  result  given  the  same  input)  [Goodenough  2010].  Therefore,  the  focus  in 
software  development  has  been  on  testing  to  discover  and  remove  bugs  using  various  test  cover¬ 
age  metrics  to  determine  test  sufficiency.  Failure-probability  density  function  based  on  code  met¬ 
rics,  such  as  source  lines  of  code  (SLOG)  and  cyclomatic  (conditional)  complexity  have  been  used 
as  predictors  with  limited  success  [Kaner  2004].  Neither  is  a  good  measure  of  system-level  inter¬ 
action  complexity  and  nonfunctional  properties  such  as  performance,  reliability,  or  safety.  This  is 
due  to  the  fact  that  time-sensitive  software  component  interactions  may  encounter  race  conditions, 
unexpected  latency  jitter,  and  unanticipated  resource  contention,  which  occur  non- 
deterministically.  We  not  only  need  better  reliability  metrics,  but  also  a  change  in  the  way  we  en¬ 
gineer  and  qualify  software-reliant  systems.  Steps  in  that  direction  include  the  use  of  the  Architec¬ 
ture  Tradeoff  Analysis  Method®  (ATAM®)  developed  at  the  Carnegie  Mellon®  Software  Engi¬ 
neering  Institute  (SEf)  [Kazman  2000].  Applying  the  ATAM  helps  to  identify  architectural  risks 
in  early  design  phases.  The  use  of  architecture  models  with  well-defined  semantics,  such  as  the 
Society  of  Automotive  Engineers  (SAE)  Architecture  Analysis  and  Design  Language  (AADL), 
helps  to  drive  early  detection  of  errors  through  system-level  analysis. 

The  U.S.  Army  Materiel  Systems  Analysis  Activity  (AMSAA)  Reliability  Growth  Guide  defines 
reliability  growth  as  “the  improvement  in  a  reliability  parameter  over  a  period  of  time  due  to 
changes  in  the  product  design  or  the  manufacturing  process  [AMSAA  2000].  ft  occurs  by  surfac¬ 
ing  failure  modes  and  implementing  effective  corrective  actions.”  The  AMSAA  provides  funding 
for  hardware  reliability  improvement  programs  that  use  modeling,  analysis,  and  simulation  to 


See  footnote  1  for  an  explanation  of  the  bathtub  curve. 

Architecture  Tradeoff  Analysis  Method  and  ATAM  are  registered  in  the  U.S.  Patent  and  Trademark  Office  by 
Carnegie  Mellon  University. 

Carnegie  Mellon  is  registered  in  the  U.S.  Patent  and  Trademark  Office  by  Carnegie  Mellon  University. 
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identify  and  reduce  design  defects  before  the  system  is  built,  while  software  funding  is  focused  on 
finding  and  removing  code  faults  through  code  inspection  and  testing.  There  is  a  clear  need  to 

•  extend  reliability  improvement  programs  to  software  designs 

•  include  software  failure  modes  in  the  system  design 

•  design  for  robustness  to  address  unplanned  usage  and  hazard  conditions  [Goodenough  2010]. 

In  addition,  it  is  necessary  to  define  metrics  for  quantifying  reliability  improvement  of  software- 
reliant  systems  and  developing  justified  confidence  in  the  system  behavior. 

Several  studies  identify  technologies  that  are  key  to  addressing  these  challenges  in  software- 
reliant  systems  and  the  need  to  integrate  these  technologies  into  a  system  and  software  co¬ 
engineering  practice  [Dvorak  2009,  Conquet  2008,  Redman  2010,  Boehm  2006].  The  identified 
technologies  are  listed  below: 

•  model-based  engineering  driven  by  architecture  models  with  well-defined  semantics 

•  improved  specification  of  mission  and  safety-criticality  requirements  with  focus  on  system 
interaction  with  its  operational  context  and  between  subsystems 

•  application  of  static  analysis  based  on  formal  methods  to  complement  testing  in  the  end-to- 
end  verification  and  validation  (V&V)  of  systems 

•  assurance  cases  to  systematically  provide  evidence  for  justified  confidence  that  a  system 
meets  its  intent  and  requirements. 

Several  initiatives  have  been  under  way  in  industrial  settings  to  demonstrate  the  maturation  and 
integration  of  these  technologies  including 

•  the  development  of  the  international  SAE  AADL  standard  [SAE  2004-2012]  based  on  re¬ 
search  on  Architecture  Description  Languages  (ADLs),  funded  by  the  Defense  Advanced  Re¬ 
search  Projects  Agency  (DARPA)  with  strong  industrial  participation 

•  the  application  of  the  AADL  standard  for  embedded  systems  to  drive  the  development  and 
verification  of  two  satellite  system  families.  The  European  Space  Agency  (ESA)  led  this  initi¬ 
ative — ^the  Automated  proof-based  System  and  Software  Engineering  for  Real-Time  applica¬ 
tions  (ASSERT)  initiative  [Conquet  2008]. 

•  the  use  of  virtual  system  and  software  integration  that  reduces  integration  errors  and  that  cen¬ 
ters  on  a  single-source-of-truth*  architectural  reference  model  based  on  the  SAE  AADL.  Such 
errors  decline  through  discovery  of  system-level  problems  through  model-based  analysis  of 
mission  and  safety-criticality  properties  throughout  the  development.  This  work  has  been  un¬ 
dertaken  by  an  international  aircraft  industry  initiative  called  System  Architecture  Virtual  In¬ 
tegration  (SAVI)  [Redman  2010,  Feiler  2010]. 

•  a  system-theoretic  approach  to  safety  engineering  [Leveson  2005]  that  builds  on  early  work  in 
formalized  requirement  specification  by  Pamas  [Pamas  1991] 

•  the  systematic  application  of  model  checking  to  formalized  requirement  specifications  and 
system  and  software  designs,  as  well  as  to  code  [Tribble  2002,  Miller  2010,  Gurfinkel  2008] 

•  the  generalization  of  safety  cases,  which  are  part  of  UK  Defense  Std  00-56:  Safety  Manage¬ 
ment  Requirements  for  Defense  Systems,  into  assurance  cases  [Goodenough  2009]. 

®  In  a  single  source  of  truth  model  structure,  every  data  element  is  stored  exactly  once. 
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Other  initiatives  include 

•  the  European  Support  for  Predictable  Integration  of  mission  Critical  Embedded  Systems 
(SPICES).  This  initiative  is  integrating  model-based  engineering,  using  AADL  with  the 
Common  Object  Request  Broker  Architecture  (CORE A)  Component  Model  (CCM)  and  auto 
generation  of  SystemC  code  [SPICES  2006] 

•  Toolkit  in  OPen-source  for  Critical  Applications  and  SystEms  Development  (TOPCASED), 
an  Eclipse-based  open  source  environment  and  embedded  system  development  platform  that 
supports  model-based  engineering  through  multiple  notations  via  the  model  bus  concept 
[Heitz  2008] 

•  Correctness,  Modeling,  and  Performance  of  Aerospace  SystemS  (COMPASS),  focusing  on  a 
system  and  software  co-engineering  approach  through  a  coherent  set  of  specification  and 
analysis  techniques  to  evaluate  correctness,  safety,  dependability,  and  performability  in  aero¬ 
space  systems  [COMPASS  2011] 

•  the  DARPA  META  program  aiming  to  achieve  dramatic  improvement  of  the  systems  engi¬ 
neering,  integration,  and  testing  process  through  architecture-centric,  model-based  design  ab¬ 
stractions  that  lead  to  quantifiable  verification  and  optimization  of  system  design  [DARPA 
2010]. 

To  accommodate  these  technology  advances,  process  and  practice  standards  are  being  revised  to 
foster  better  system  and  software  co-engineering,  including  the 

•  recent  alignment  of  process  standards  for  the  systems  life  cycle  (ISO/IEC  15288)  and  for  the 
software  life  cycle  (ISO/IEC  12207)[ISO/IEC  2008a,  2008b] 

•  revision  of  recommended  practice  for  architectural  description  of  software-intensive  systems 
(IEEE  1471)  [IEEE  2000] 

•  revision  of  software  considerations  in  airborne  systems  and  equipment  certification  (DO- 178 
revision  C)  incorporating  tool  qualification,  model-based  design  and  verification,  use  of  for¬ 
mal  methods,  and  application  of  object-oriented  technology  [RTCA  1992] 

•  embracing  of  Model-Based  System  Engineering  (MBSE)  by  the  International  Council  on  Sys¬ 
tems  Engineering  (INCOSE)  through  a  set  of  grand  challenges  [INCOSE  2010] 

•  development  by  the  Object  Management  Group  (OMG)  of  Unified  Modeling  Language 
(UML)  profiles  such  as  Systems  Modeling  Language  (SysML)  for  system  engineering 
[SysML.org  2010]. 

1 .2  Definition  of  Key  Terms 

Before  we  proceed  with  the  report,  we  define  some  terms  in  the  context  of  this  report. 

We  use  the  term  software-reliant  systems  (SRSs)  to  identify  systems  whose  mission  and  safety- 
criticality  requirements  are  met  by  an  integrated  set  of  embedded  software  and  by  their  interaction 
with  their  operators  and  other  systems  in  their  operational  environment.  Such  systems  are  also 
referred  to  as 

•  software-intensive  systems  (SISs),  indicating  the  need  for  system  and  software  co-engineering 
[Boehm  2006] 
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•  distributed  real-time  embedded  (DRE)  systems  to  indicate  that  they  represent  an  integrated  set 
of  embedded  software 

•  cyber-physical  systems  (CPSs)  to  indicate  that  the  embedded  software  interacts  with,  manag¬ 
es,  and  controls  a  physical  system  [Lee  2008]. 

System  reliability  is  defined  as  the  ability  of  a  system  to  perform  and  maintain  its  required  func¬ 
tions  under  nominal  and  anomalous  conditions  for  a  specified  period  of  time  in  a  given  environ¬ 
ment.  This  definition  is  adapted  from  a  definition  by  the  National  Computer  Security  Center 
[NCSC  1988].  System  reliability  is  typically  expressed  by  a  failure-probability  density  function 
over  time. 

Airworthiness  qualification  is  defined  as  the  demonstration  of  an  aircraft  or  aircraft  subsystem  or 
component,  including  modifications,  to  function  safely,  meeting  performance  specifications  when 
used  and  maintained  within  prescribed  limits  [U.S.  Army  2007]. 

System  assurance  is  defined  as  justified  confidence  that  the  system  functions  as  intended  and  is 
free  of  exploitable  vulnerabilities,  either  intentionally  or  unintentionally  designed  or  inserted  as 
part  of  the  system  at  any  time  during  the  life  cycle  [NDIA  2008]. 

1.3  Purpose  and  Structure  of  This  Report 

The  U.S.  Army  Aviation  and  Missile  Research  Development  and  Engineering  Center  (AMRDEC) 
Aviation  Engineering  Directorate  (AED)  funded  the  SEl  to  develop  a  reliability  validation  and 
improvement  framework.  The  purpose  is  to  address  the  challenges  of  qualifying  increasingly 
software-reliant  safety-critical  systems  by  overcoming  limitations  of  current  reliability  engineer¬ 
ing  approaches.  Achieving  these  goals  requires  leveraging  best  emerging  engineering  technolo¬ 
gies  and  practices  to  complement  the  process  focus  of  current  practice,  finding  acceptance  in  in¬ 
dustry,  and  leading  an  effort  to  a  new  set  of  reliability  improvement  metrics. 

In  this  report,  we 

•  summarize  the  findings  of  the  background  research  in  terms  of  key  challenges  in  the  qualifi¬ 
cation  of  safety-critical,  software-reliant  systems 

•  discuss  an  engineering  framework  for  reliability  validation  and  improvement  that  integrates 
several  engineering  technologies 

•  outline  a  new  set  of  metrics  that  focus  on  cost-effective  reliability  improvement. 

We  close  the  report  by  describing  opportunities  to  leverage  ongoing  industry  and  standards  efforts 
and  potential  follow-on  activities  specific  to  the  U.S.  Army  that  aim  to  accelerate  adoption  of 
these  proposed  improvements  in  engineering  and  qualification  practice. 
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2  Challenges  of  Software-Reliant  Safety-Critical  Systems 


In  this  section,  we  take  a  closer  look  at  the  challenges  arising  from  increased  reliance  on  embed¬ 
ded  software  and  potential  root  causes.  We  do  so  by  examining  experiential  data  and  by  identify¬ 
ing  high-risk  areas  that  can  benefit  from  application  of  more  effective  engineering  and  qualifica¬ 
tion  technology. 

2.1  Exponential  Growth  in  Size  and  Interaction  Complexity 

For  the  international  aerospace  industry,  the  cost  of  system  and  software  development  and  inte¬ 
gration  has  become  a  major  concern.  Aerospace  software  systems  have  experienced  exponential 
growth  in  size  and  complexity — and,  also  unfortunately,  in  errors,  rework,  and  cost.  Development 
of  safe  aircraft  is  reaching  the  limit  of  affordability.  Figure  2  shows  that  the  size  of  on-board 
software  for  commercial  aircraft  (measured  in  SLOC)  doubled  every  four  years  since  the  mid- 
1990s  and  reached  27  million  SLOC  by  2010.  Using  the  Constructive  COst  MOdel  (COCOMO) 
II  and  assuming  70%  reuse  of  software,  an  estimated  cost  to  develop  such  software  is  as  much  as 
$10  billion,  a  sum  that  would  make  up  more  than  65%  of  the  total  aircraft  development  cost. 


Slope:  0.1778  Intercept: -338.5 
Curve  Implies  SLOC  doubles  about  every  4  years 
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This  line  ht  is  pegged  at  27.5  M  SLOC 
because  the  SLOC  sizes  for  2010  - 
2020  are  not  affordable.  The  COCOMO 
II  estimated  costs  to  develop  that  much 
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Airbus  data  source:  J.  P.  Potocki  De  Montalk, 
"Computer Software  in  Ovil Aircraft.'  Sixth 
Annual  Conference  on  Software  Assurance 
(Compass  ‘91).  Gaithersburg.  MD.  June  24-27.1991 
Boeing  data  source:  J.  J.  Chilenski.  2009 
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Figure  2:  Estimated  On-Board  SLOC  Growth 

Figures  for  military  aircraft  and  rotorcraft  are  experiencing  similar  growth.  This  growth  in  size  is 
due  to  reliance  on  software  for  (1)  providing  flight-critical  capability,  such  as  fly-by-wire; 

(2)  mission  capability  through  provision  of  up-to-date  situational  awareness  and  command  and 
control;  and  (3)  fault  and  hazard  management  to  maintain  safe  and  reliable  system  operation.  The 
resulting  embedded  software  systems  have  increased  interaction  complexity  between  embedded 
software  subsystems  and  increased  potential  for  conflicting  demands  of  shared  computer  platform 
resources.  At  the  same  time,  digitalization  and  implementation  as  an  IMA  architecture  has  result¬ 
ed  in  weight  reduction. 
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2.2  Error  Leakage  Rates  in  Software-Reliant  Systems 

A  number  of  studies  have  been  performed  on  where  errors  are  introduced  in  the  development  life 
cycle,  when  they  are  discovered,  and  the  cost  of  the  resulting  rework.  We  limit  ourselves  here  to 
work  by  the  National  Institute  of  Standards  and  Technology  (NIST,)  Galin,  Boehm,  and  Dabney 
[NIST  2002,  Galin  2004,  Boehm  1981,  Dabney  2003].  The  NIST  data  primarily  pertains  to  in¬ 
formation  technology  applications,  while  the  other  studies  draw  on  safety-critical  systems. 

Figure  3  shows  a  summary  of  three  data  points  across  development  phases:  percentage  of  error 
introduced  in  a  particular  phase;  percentage  of  errors  discovered  in  a  particular  phase;  and  a  re¬ 
work  cost  factor  normalized  with  respect  to  the  cost  of  repair  in  the  requirements  phase.  The  re¬ 
work  cost  figures  include  the  cost  of  retest. 


B.W.  Boehm,  Software  Engineering  Economics,  Prentice  Hal  (1981) 

Code 
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Figure  3:  Error  Leakage  Rates  Across  Development  Phases 

The  percentages  of  errors  introduced  and  discovered  were  quite  consistent  across  the  three  stud¬ 
ies.  The  figure  shows  that  70%  of  all  errors  are  introduced  during  requirements  engineering,  in¬ 
cluding  system  design  (35%),  and  software  architectural  design  (35%).  In  comparison,  only  20% 
of  the  errors  are  discovered  by  the  end  of  code  development  and  unit  test,  while  80%  of  the  errors 
are  discovered  at  or  after  integration  testing.  The  figure  shows  fhat  20%  of  the  errors  are  intro¬ 
duced  during  code  development  and  unit  testing  and  16%  of  the  errors  are  discovered  in  that 
phase. 

Overall,  the  data  shows  that  we  need  to  do  a  better  job  of  getting  the  requirements  specified  and  of 
managing  the  interaction  complexity  between  system  components  not  only  in  terms  of  system 
functionality  but  also  in  terms  of  nonfunctional  system  properties.  Note  that  the  studies  provide 
more  detailed  data  in  terms  of  leakage  rates  between  phases  than  we  present  here. 

Figure  3  shows  rework  cost  factors  based  on  the  Galin  and  Boehm  studies  [Galin  2004,  Boehm 
1981].  Table  1  provides  more  details  regarding  error  rework  cost  factors  from  all  four  of  the  stud- 
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ies.  The  numbers  in  Table  1  support  the  estimated  nominal  cost  for  fault  removal  shown  in  Figure 

3.  Given  those  cost  factor  numbers,  the  rework  cost  for  requirements  errors  alone  makes  up  78% 
of  the  total  rework  cost.  In  other  words,  there  is  high  leverage  in  cost  reduction  through  a  focus  on 
early  discovery  of  requirements-  and  system-design-related  errors.  In  the  next  section,  we  take  a 
closer  look  at  requirements-related  error  data  to  gain  insight  into  how  to  improve  the  situation.^ 

Table  1:  Error  Rework  Cost  Factors  Relative  to  Phase  of  Origin 


Phase 

Relative  Defect  Removal  Cost  of  Each  Phase  of  Origin 

Requirements 

Design 

Coding 

Unit  test 

Integration 

[NIST  2002] 

[Galin,  Boehm 
1981] 

[Dabney  2003] 

[NIST  2002] 

[Galin  2004, 

Boehm  1981] 

[Dabney  2003] 

[NIST  2002] 

[Galin  2004, 

Boehm  1981] 
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2.3  Requirements  Errors 

Hayes  used  a  requirement  fault  taxonomy  from  a  U.S.  Nuclear  Regulatory  Commission  guideline 
(NUREG/CR-6316)  to  examine  some  National  Aeronautics  and  Space  Administration  (NASA) 
data  on  system  errors  [Hayes  2003,  Groundwater  1995].  The  data  shows  that  the  top  six  require¬ 
ment-related  error  categories  are 

1 .  omitted/missing  requirements  (33%) 

2.  incorrect  requirements  (24%) 

3.  incomplete  requirements  (21%) 

4.  ambiguous  requirements  (6.3%) 

5 .  overspecified  requirements  (6.1%) 

6.  inconsistent  requirements  (4.7%). 

A  requirements  engineering  study  for  the  Federal  Aviation  Administration  (FAA)  published  in 
2009  [FAA  2009a]  included  an  industry  survey  of  requirements  engineering  practices  that  is  in 
compliance  with  Radio  Technical  Commission  for  Aeronautics  standard  DO-178B  [RTCA  1992]. 
The  survey  includes  data  on  the  notations  and  tools  used  in  capturing  requirements. 

Figure  4  shows  the  notations  used  by  different  organizations.  English  text  and  structured  Shall .  .  . 
statements  together  with  tables  and  diagrams  are  the  dominant  notations.  Note  that  tables  include 

®  An  asterisk  in  the  table  means  that  the  study  cited  in  that  column  did  not  have  data  on  this  category  (i.e.,  did  not 
distinguish  it). 
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representation  of  state  information,  such  as  use  of  truth  tables.  The  next  set  of  notations  shows  the 
use  of  executable  models  such  as  MATLAB/Simulink  and  the  use  of  data  flow  diagrams  (i.e.,  no¬ 
tations  that  can  be  analyzed  by  tools). 

Figure  4  also  shows  reported  tool  usage  for  requirements  capture.  The  tools  are  dominated  by 
word  processing  tools  such  as  Microsoft  Word  and  by  Dynamic  Object  Oriented  Requirements 
System  (DOORS),  which  provides  good  support  for  requirements  traceability.  Databases  are  used 
as  an  alternate  way  of  maintaining  requirements  traceability.  Spreadsheets  are  effective  in  main¬ 
taining  tabular  representations,  especially  if  they  include  some  computation.  Simulink  as  a  tool 
supports  representation  of  executable  Simulink  models. 
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Figure  4:  Notations  and  Toois  Used  in  DO-178B-Compiiant  Requirements  Capture 
[RTCA  1992] 

A  study  by  Groundwater  and  colleagues  investigates  the  effectiveness  of  different  techniques  for 
finding  errors  in  a  single-mode  transition  diagram  and  in  two  interacting-mode  transition  dia¬ 
grams  [Groundwater  1995].  The  results  are  shown  in  Figure  5.  Since  expected  mode  behavior  in 
the  form  of  mode  transition  is  commonly  part  of  requirement  specification,  it  is  clear  that  errors  in 
such  requirement  specifications  can  easily  propagate  into  the  design  and  implementation  phases. 
Therefore,  it  is  desirable  to  validate  such  specifications  during  requirements  capture. 
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Figure  5:  Effectiveness  of  Different  Error  Detection  Techniques 


2.4  Mismatched  Assumptions  in  System  interaction 

System  engineers  are  concerned  about  the  interface  between  the  system  and  its  operator,  as  well 
as  the  interaction  of  the  system  with  its  operational  environment.  In  many  cases,  the  system  con¬ 
sists  of  a  system  under  control  and  a  control  system  that  monitors,  controls,  and  manages  its  oper¬ 
ation.  In  the  process  of  specifying  the  requirements  for  the  system,  engineers  make  assumptions 
about  how  operators  interact  with  the  system  and  about  the  operational  environment.  Similarly, 
they  make  assumptions  about  the  physical  system  under  control  when  they  specify  the  require¬ 
ments  for  the  control  system.  These  assumptions  may  not  always  be  valid  and  may  be  violated 
over  time  due  to  changes  in  the  operational  context  or  in  the  system  itself.  The  following  exam¬ 
ples  illustrate  the  point: 

•  It  is  common  for  physical  systems  to  make  assumptions  about  certain  conditions  of  the  opera¬ 
tional  environment,  such  as  the  temperature  range,  in  which  the  system  should  be  operated. 

•  Operators  are  expected  to  have  situational  awareness  of  the  operational  environment  with 
sometimes  limited  or  misleading  information  or  guidance.  The  result  can  be  an  accident,  as 
was  the  case  in  the  ComAir  crash  when  one  of  the  taxiways  was  under  construction  [Nelson 
2008]. 

•  An  example  of  mismatched  assumptions  made  about  the  interaction  between  the  operator  and 
the  system  is  an  incident  in  which  a  subway  train  left  the  platform  without  the  operator  pre¬ 
sent  at  the  operator  console.  One  of  the  doors  on  the  first  car  had  difficulty  closing.  The  oper¬ 
ator  stepped  out  of  the  operator  cabin  to  close  the  door  and  the  semi-automated  system — 
sensing  that  all  doors  were  closed — departed  with  the  operator  standing  on  the  platform. 

Similarly,  assumptions  exist  when  a  control  system  is  designed  for  a  system  under  control  (see 
Figure  6).  For  example,  engineers  make  assumptions  about  the  lift  generated  by  aircraft  wings  in 
determining  the  maximum  load  and  in  identifying  the  operational  envelope.  The  interaction  com¬ 
plexity  among  several  system  parameters  can  lead  to  violation  of  assumptions  about  system  pa¬ 
rameters  and  result  in  incidents  or  accidents.  This  circumstance  caused  Air  France  flight  447  to 
crash  en  route  from  Brazil  to  France  [Spiegel  2010].  Flight  447  was  loaded  to  within  240kg  of 
maximum  capacity,  and  its  estimated  fuel  consumption  was  based  on  a  majority  of  the  flight  oc¬ 
curring  at  39,000  feet.  At  that  altitude,  the  operational  speed  for  maintaining  the  required  lift  is 
quite  narrow,  which  increases  the  risk  of  operating  outside  a  safe  flight  envelope  in  non-nominal 
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situations  such  as  severe  turbulences  due  to  storms.  This  example  illustrates  the  challenge  of  un¬ 
derstanding  all  system  hazards  and  specifying  appropriate  safety  and  reliability  requirements. 
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Figure  6:  Interaction  Complexity  and  Mismatched  Assumptions  with  Embedded  Software 

The  upper  half  of  Figure  6  illustrates  some  of  this  interaction  complexity  and  the  potential  for 
mismatched  assumptions.  As  these  control  and  under-control  systems  have  become  software  reli¬ 
ant,  new  areas  of  interaction  complexity  and  potential  for  mismatched  assumptions  are  introduced 
as  shown  on  the  lower  half  of  Figure  6. 

As  system  functionality  is  implemented  in  software,  variables  in  the  environment  are  translated 
into  input  and  output  variables  on  which  the  embedded  software  operates.  System  engineers  may 
assume  the  data  in  the  variables  to  be  expressed  in  a  particular  measurement  unit,  which  may  not 
have  been  communicated  to  the  software  engineer  when  system  requirements  were  translated  into 
software  requirements.  Similarly,  the  expected  range  of  values  and  the  degree  of  precision  in 
which  they  are  represented  is  affected  by  the  base  type  chosen  for  the  variable.  For  example,  one 
of  the  contributing  factors  to  the  Ariane  5  accident  was  the  use  of  a  16-bit  integer  variable,  which 
could  handle  the  range  of  values  for  Ariane  4,  but  resulted  in  negative  values  due  to  wraparound 
[Nuseibeh  1997]. 

Application  software  is  integrated  into  a  runtime  architecture  that  supports  multiple  operation¬ 
al  modes,  with  different  modes  involving  different  subsets  of  active  tasks  and  communication 
channels.  In  the  way  tasks  interact,  race  conditions  and  a  nondeterministic  sequence  of  actions  can 
result,  due  to  the  application  software  making  assumptions  about  the  runtime  environment.  For 
example,  the  application  software  may  assume  that  two  tasks  may  not  require  explicit  synchroni¬ 
zation  because  execution  of  both  tasks  on  the  same  processor  using  a  non-preemptive  scheduling 
protocol  ensures  mutual  exclusion.  These  assumptions  may  be  violated  in  migration  to  the  use  of 
a  multi -core  processor  or  other  different  runtime  systems  and  computer  hardware.  Analysis  of 
formalized  system  models  allows  us  to  (1)  discover  these  problems  early  in  the  technology  refresh 
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cycle;  and  (2)  increase  our  confidence,  having  addressed  intricate  time-sensitive  logic  errors  that 
are  difficult  to  test  for. 

The  computer  platform  is  typically  a  distributed  networked  set  of  processors  with  redundancy  to 
provide  reliable  system  operation.  This  means  that  replicated  instances  of  the  embedded  software 
execute  on  different  processor  instances  and  communicate  over  different  network  instances.  A 
change  in  the  deployment  configuration  of  the  embedded  software  may  lead  to  replicated  software 
components’  allocation  to  the  same  physical  processor  and  to  the  elimination  of  physical  redun¬ 
dancy.  Similarly,  migration  of  embedded  software  to  a  partitioned  runtime  architecture  such  as 
ARINC653*°  can  result  in  reduced  reliability,  if  the  mapping  of  embedded  software  to  partitions 
and  the  binding  of  partitions  to  physical  hardware  are  not  performed  consistently  with  the  redun¬ 
dancy  requirements  for  the  system.  Finally,  virtualization  can  lead  to  unplanned  resource  conten¬ 
tion  and  performance  that  is  slower  than  expected  [Nam  2009]. 

Embedded  applications  may  process  time-sensitive  data  and  process  data  in  a  time-sensitive  man¬ 
ner.  For  example,  a  control  system  makes  assumptions  about  the  latency  of  a  data  stream  from  a 
sensor  to  an  actuator.  Different  control  algorithms  have  different  thresholds  of  sensitivity  to  sam¬ 
pling  jitter,  which,  if  exceeded,  can  result  in  unstable  control  behavior  [Feiler  2008].  Differences 
in  the  task  execution  and  communication  timing  of  different  runtime  architectures  and  their  par¬ 
ticular  hardware  deployments  can  affect  latency,  as  well  as  sampling  and  latency  jitter.  For  exam¬ 
ple,  latent  delivery  of  data  such  as  helicopter  main  rotor  speed  (Nr)  in  an  autorotation  can  lead  to  a 
catastrophic  result. 

Similarly,  it  is  common  practice  to  implement  event  processing  by  periodically  sampling  a  data 
variable  whose  change  in  value  signals  that  an  event  has  occurred  and  that  reverts  to  its  original 
value  after  a  given  duration.  The  application  logic  assumes  that  all  events  are  communicated  to 
the  recipient.  Flowever,  when  the  variation  of  the  sampling  exceeds  a  certain  threshold,  the  recipi¬ 
ent  fails  to  observe  an  event.  Such  an  unanticipated  loss  of  events  can  result  in  inconsistent  system 
states  and  deadlock  in  system  interaction  protocols. 

In  a  study  of  embedded  software  systems  with  system-level  problems  that  escape  traditional  fault 
tolerance  mechanisms,  the  SEI  has  identified  four  root  cause  areas  that  require  attention  [Feiler 
2009b]: 

1 .  processing  of  data  streams  in  a  time-sensitive  manner.  Data  streams  are  often  processed 
in  multiple  steps.  Different  components  involved  in  processing  this  data  stream  make  as¬ 
sumptions  about 

•  the  data  of  a  data  stream:  for  example,  the  application  data  type  (e.g.,  temperature),  its  base 
type  representation  (e.g.,  16-bit  unsigned  integer),  acceptable  range  of  values,  base  value  that 
is  represented  as  zero  (e.g.,  -50),  and  measurement  unit  (e.g.,  degree  Celsius) 

•  the  timing  of  the  data  stream:  age  of  the  data  (i.e.,  time  since  it  was  read  by  a  sensor),  data 
latency  (i.e.,  handling  time  of  new  data),  and  latency  jitter  (i.e.,  variation  in  latency) 

•  the  data  stream  characteristics:  for  example,  acceptable  transmission  rates,  acceptable  rates  of 
missing  stream  elements,  out  of  sequence  data,  dropped  data,  corrupted  data,  data  encryp¬ 
tion/decryption,  and  acceptable  limits  in  value  changes  between  elements  of  the  data  stream 

ARINC653  is  a  specification  for  system  partitioning  and  scheduling  that  is  often  required  in  safety-  and  mission- 
critical  systems. 
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•  synchronization  of  data  in  voting  or  self-checking  pair  system,  time  stamping,  and  time  dis¬ 
tribution  throughout  a  system 

2.  interaction  between  state-based  systems  with  replicated,  mirrored,  and  coordinated 
state  machines.  Examples  are  replicated  discrete  state  applications,  multiple  distributed  in¬ 
stances  of  redundancy  management  logic,  and  handshaking  protocols  or  coordinated  opera¬ 
tional  modes.  The  state  transition  logic  embedded  in  the  state  machine  may  make  assump¬ 
tions  about 

a.  the  interaction  of  replicated  and  mirrored  state  machines  by  working  with  the  same  in¬ 
puts  exclusively,  by  comparing  states  periodically,  or  by  observing  each  other’s  state 
behavior  in  order  to  detect  anomalous  behavior.  The  state  logic  may  not  accommodate 
failures  in  the  application  logic,  in  the  underlying  hardware,  or  in  timing  differences 
due  to  an  asynchronous  computer  platform. 

b.  the  communication  of  state  versus  state  change  (e.g.,  exchange  of  track  information  and 
track  updates).  Communication  of  state  change  information  assumes  guaranteed  and  of¬ 
ten-ordered  delivery  of  information  by  the  communication  protocols  and  hardware. 

c.  the  communication  of  events  by  sampling  state  variables.  The  particular  implementa¬ 
tion,  while  maintaining  a  periodic  task  set,  may  not  guarantee  observation  of  every 
event  or  queuing  of  events  if  arrival  burst  exceeds  the  processing  rate  due  to  the  mis¬ 
match  in  paradigms  of  guaranteed  event  processing  and  data  sampling. 

3.  performance  impact  of  resource  management.  Such  impact  is  especially  important  when 
sharing  computer  resources  in  IMA  architectures  and  can  lead  to  a 

a.  mismatch  of  resource  demand  and  capacity,  where  the  demand  may  exceed  the  capacity 
or  capacity  of  one  resource  may  exceed  capacity  of  connected  resource.  For  example,  a 
high-bandwidth  gigabit  ethemet  network  can  flood  low-performance  processors  result¬ 
ing  in  denial  of  service  and  lower  than  expected  processor  speed. 

b.  lack  of  guaranteed  resource  capacity  assumed  to  be  available  to  the  embedded  applica¬ 
tion  due  to  undocumented  resource  sharing  and  unmanaged  resource  usage.  For  exam¬ 
ple,  direct  memory  access  (DMA)  transfers  that  continue  independent  of  the  application 
software  initiating  them  consume  bus  and  memory  bandwidth  assumed  to  be  available 
to  other  application  software. 

c.  mismatch  in  assumptions  by  the  application  in  the  execution  and  communication  timing 
and  ordering,  and  in  the  scheduling  of  the  processor  and  network  resources  by  the  un¬ 
derlying  runtime  system 

4.  virtualization  of  resources.  The  virtualization  of  resources  such  as  processors  and  networks 
brings  flexibility  to  a  system  design  and  provides  resource  budget  enforcement.  However,  it 
can  lead  to 

a.  loss  of  reliability  due  to  differences  in  logical  redundancy  and  physical  redundancy — if 
logical  resources  are  mapped  to  physical  resources  in  conflict  with  application  and  safe¬ 
ty-criticality  assumptions 

b.  limitations  in  resource  isolation  guarantees  in  terms  of  both  guaranteed  resource  capaci¬ 
ty  available  to  a  virtual  resource  (e.g.,  virtual  channels  competing  for  resources)  and  in¬ 
formation  leakage  between  applications  due  to  resource  sharing 
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c.  time  inconsistency  due  to  the  application  software’s  operating  in  virtual  time.  For  ex¬ 
ample,  multitasking  and  execution  of  embedded  applications  in  partitions  do  not  guar¬ 
antee  input  sampling  at  known  time  intervals  when  such  input  sampling  is  performed  as 
part  of  the  application  code.  Similarly,  time  stamping  of  time-sensitive  data  can  lead  to 
inconsistency  in  a  multi -clock  distributed  platform. 

d.  mixed-criticality  systems  in  which  applications  with  periodic  and  event-driven  resource 
demands  and  different  security  levels,  safety  levels,  and  redundancy  requirements  must 
use  shared  resources  consistently  despite  conflicting  demands 

2.5  Software  Hazards  and  Safety-Criticality  Requirements 

Safety-critical  systems  have  reliability,  safety,  and  security  requirements.  These  requirements  are 
typically  addressed  as  part  of  system  engineering.  The  reliability  of  a  system  and  its  components 
is  driven  by  availability  requirements  and  by  the  safety  implications  of  failing  system  compo¬ 
nents.  The  FAA  has  introduced  five  levels  of  criticality  and  has  associated  reliability  figures  in 
terms  of  mean  time  between  failures  (MTBF).  Typically  the  required  reliability  numbers  are 
achieved  through  redundancy  (e.g.,  through  dual  or  triple  redundancy  in  flight  control  systems). 

Similarly,  the  safety  of  a  system  is  assured  through  a  series  of  analyses  that  identifies  hazards  and 
their  manifestation  through  Functional  Flazard  Assessment  (FFIA),  followed  by  Preliminary  Sys¬ 
tem  Safety  Analysis  (PSSA)  and  System  Safety  Analysis  (SSA)  for  the  top-level  system  design. 
Next,  Common  Cause  Analysis  (CCA)  identifies  system  components  that  violate  the  independ¬ 
ence  assumption  of  failure  occurrences  of  many  reliability  predictions,  taking  on  the  form  of  fault 
tree  analysis  (FTA),  and  failure  mode  and  effects  analysis  (FMEA)  as  the  system  design  evolves 
[SAE  1996,  FAA  2000]. 

Experience  has  shown  that  such  safety  analysis  must  take  into  account  interactions  with  operators 
and  the  operational  environment,  as  well  as  the  development  environment  as  sources  of  contrib¬ 
uting  and  systemic  hazards  [Leveson  2004a,  2005].  Controlling  these  sources  of  hazard  involves 

•  translating  the  results  of  such  safety  hazard  analysis  into  safety  requirements  on  the  system 

•  validating  these  requirements  for  completeness  and  consistency,  and  for  ensuring  that  they  are 
satisfied  at  design  time  or  managed  by  fault  tolerance  mechanisms  in  the  system  when  violat¬ 
ed. 

Such  translation  and  validation  has  led  to  the  formalization  of  requirements,  often  expressed  as 
discrete  state  behavior  and  boundary  conditions  on  physical  system  and  environmental  state  [Par- 
nas  1991,  Leveson  2000,  Tribble  2002]  and  their  V&V  through  formal  methods  [Groundwater 
1995].  It  has  also  led  to  an  understanding  that  system  safety  and  reliability  are  emergent  system 
properties.  System  reliability  can  be  achieved  with  unreliable  components,  and  reliable  system 
components  do  not  guarantee  system  reliability  or  system  safety  [Leveson  2009]. 

Understanding  how  to  quantify  software’s  contribution  to  system  reliability  and  safety  has  been  a 
challenge.  Initially  it  may  seem  that  software  reliability  cannot  be  addressed  in  terms  of  MTBF 
because  software  errors  either  exist  or  do  not  exist  (i.e.,  a  software  function  will  always  produce 
the  same  result  when  executed  under  exactly  the  same  conditions).  Flowever,  such  conditions  in¬ 
clude  not  only  the  function’s  explicit  inputs,  but  also  its  environment,  which  may  reflect  its  execu¬ 
tion  history  and  other  activities  that  are  happening  concurrently  with  the  function’s  execution.  For 


CMU/SEI-2012-SR-013  |  14 


example,  software  interacting  with  other  systems  or  with  humans  may  be  sensitive  to  which  func¬ 
tions  and  actions  have  been  taken  prior  to  the  function’s  invocation.  Likewise,  its  behavior  might 
be  dependent  on  timing,  resource  contention,  or  other  functions  that  are  executing  concurrently 
with  it.  In  such  cases,  only  a  particular  execution  order  and  environmental  state  may  cause  an  er¬ 
ror.  Even  though  a  given  execution  order  and  environmental  state  will  produce  the  same  errone¬ 
ous  result  every  time,  the  likelihood  that  the  error-inducing  fault  activations  will  occur  depends  on 
the  operational  use  circumstances  of  the  system. 

Errors  in  software  designs  and  implementations  present  both  reliability  and  safety  hazards  and 
must  be  treated  accordingly;  that  is,  they  must  be  eliminated,  reduced  in  likelihood,  or  mitigated. 
We  need  to  understand  the  hazards  introduced  by  possible  errors  in  software  and  their  impact  on 
system  reliability  and  safety.  This  requires  an  understanding  of  the  role  of  software  in  system  reli¬ 
ability  and  safety.  Software  is  not  just  a  source  of  failure;  it  is  also  responsible  for  managing  fault 
tolerance.  Embedded  software  may  contribute  to  the  desired  mission  capability,  to  reliability  (by 
implementing  fault  management  of  physical  system  components,  of  the  computer  platform,  and  of 
the  mission  software),  and  to  safety  (by  monitoring  for  violation  of  safety  requirements). 

Clearly,  reducing  software  errors  is  a  way  of  improving  software  reliability  (i.e.,  it  is  a  way  of 
reducing  the  likelihood  that  software  will  fail  under  certain  conditions).  But  there  are  two  im¬ 
portant  classes  of  software  errors — architectural  errors  (introduced  in  the  design  phase)  and  im¬ 
plementation  errors  (introduced  in  the  implementation  phase).  The  distinction  between  the  types 
of  errors  pertains  not  only  to  the  phase  at  which  they  are  introduced:  the  distinction  lies  in  the 
characteristic  of  the  errors.  In  particular,  what  we  call  “architectural”  errors  have  to  do  with  the 
explicit  (and  implicit)  interactions  between  system  components.  (Implicit  interactions  can  occur 
through  contention  for  shared  resources,  timing  properties,  dependence  on  shared  state,  etc.)  In 
practice,  the  architectural  errors  (sometimes  called  design  errors)  are  the  ones  most  often  impli¬ 
cated  in  actual  accidents  [Leveson  2004b].  Methods  and  processes  intended  to  improve  system 
safety,  reliability,  and  security  must  focus  on  detecting  (and  eliminating)  architectural  errors.  Be¬ 
cause  architectural  errors  can  never  be  completely  eliminated  in  complex  systems,  such  methods 
must  also  ensure  that  the  failure  effects  of  errors  are  adequately  mitigated — in  addition  to  validat¬ 
ing  the  source  code  of  the  implementation. 

2.6  Operator  Errors  and  Work-Arounds 

Historical  data  [Couch  2010]  and  numerous  investigations  of  rotorcraft  accidents  identify  the  op¬ 
erator/pilot  as  the  root  cause  in  80%  of  the  cases.  This  finding  is  often  motivated  by  the  need  to 
look  for  blame  [Leveson  1995].  However,  in  order  to  improve  the  safety  record  of  such  systems, 
we  must  consider  other  contributing  factors  such  as  design  errors  or  complex  operational  proce¬ 
dures,  and  we  must  include  the  operator  in  the  system  analysis. 

When  problems  are  discovered,  a  solution  may  be  identified  but  not  installed  in  fielded  systems 
immediately.  In  the  case  of  software  problems,  the  corrections  are  design  changes,  which  may 
have  unintentional  side  effects.  Furthermore,  safety-critical  systems  require  recertification,  which 
with  the  current  practice,  is  quite  expensive  and  impractical  for  incremental  corrections.  Instead, 
work-arounds  are  added  to  operational  procedures  and  operators  may  spend  a  majority  of  their 
time  performing  them.  In  other  words,  we  have  passed  the  responsibility  to  compensate  for  sys¬ 
tem  problems  that  could  be  corrected  to  the  system  operator. 


CMU/SEI-2012-SR-013  |  15 


This  indicates  a  clear  need  for 

•  including  operator  behavior  specifications 

•  identifying  inconsistencies  and  complexities  in  the  operator’s  interaction  with  the  system, 
including  situations  where  the  operator  does  not  follow  procedures  (intentionally  or  uninten¬ 
tionally) 

•  improving  the  reliability  of  software-reliant  systems  by  reducing  error  leakage 

•  improving  the  cost  effectiveness  of  system  qualification. 

2.7  Errors  in  Fault  Management  Systems 

Safety-critical  systems  contain  mission  software  and  fault  management  software.  Fault  manage¬ 
ment  software  may  make  up  50%  or  more  of  the  total  system,  and  errors  in  the  fault  management 
logic  make  up  a  considerable  percentage  of  all  system  errors.  The  system  portion  responsible  for 
reliable  operation  is,  itself,  unreliable.  One  reason  for  this  is  a  limited  understanding  of  software 
faults  and  hazards,  due  to  assumptions  made  about  the  operational  environment,  system  compo¬ 
nents,  and  system  interactions.  A  second  reason  is  the  interaction  complexity  in  systems,  particu¬ 
larly  of  the  embedded  software.  A  third  reason  is  the  challenge  of  testing  the  fault  management 
portion  of  a  system.  Fault  management  software  is  only  executed  when  the  system  fails,  and  fault 
management  errors  are  only  triggered  when  the  system  is  already  dealing  with  an  erroneous  con¬ 
dition.  Fault  injection  has  been  used  to  exercise  fault  management,  but  has  struggled  to  address 
concurrency  and  timing-related  faults,  as  well  as  testing  under  all  expected  operational  contexts. 

Since  fault  management  is  a  critical  component  of  the  system  for  achieving  reliable  and  safe  oper¬ 
ation,  it  is  important  to  improve  its  quality.  We  can  achieve  this  by  formally  specifying  and  ana¬ 
lyzing  reliability  and  safety  requirements  to  address  identified  hazards  and  assumptions,  decom¬ 
posing  these  requirements  along  the  system  architecture,  and  validating  the  system  architecture 
and  its  implementation  (including  fault  management)  against  these  requirements. 

2.8  Reliability  Improvement  and  Degradation  Over  Time 

Current  practice  in  the  reliability  improvement  of  software  is  focused  primarily  on  finding  and 
removing  bugs  through  review  and  testing.  Testing  has  focused  on  exercising  the  code  with  vari¬ 
ous  inputs  to  ensure  that  all  code  statements  execute  as  expected  and  produce  the  expected  results. 
Many  of  the  test  coverage  approaches  reflect  the  assumption  that  software  behaves  deterministi¬ 
cally  (i.e.,  for  given  inputs,  the  software  executes  the  same  statements  and  produces  the  same  re¬ 
sults).  Black-box  testing  focuses  on  mapping  sets  of  input  data  to  expected  output  data.  White-box 
testing  focuses  on  exercising  the  program  logic  reflected  in  the  source  code  statements.  Since 
there  are  many  potential  interactions  between  source  code  statements,  programming  language" 
abstractions  have  been  introduced  to  manage  this  complexity  through  concepts  such  as 

•  data  abstraction  and  object  orientation 

•  strong  typing 

•  modularity  with  well-defined  interfaces 

•  restrictions  such  as  static  memory  allocation 


”  Ada  is  an  excellent  example  of  a  programming  language  for  reliable  software. 
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•  the  absence  of  application-level  manipulation  of  pointers  (as  found  in  high-integrity  subset 
profiles  of  programming  languages,  such  as  the  Ada  Ravenscar  profile  [Ada  WG  2001]). 

Despite  these  capabilities,  the  challenge  is  to  find  white-box,  black-box  and  system  test  coverage 
approaches  that  bound  the  amount  of  necessary  testing. 

Practice  standards  for  safety-critical  systems,  such  as  DO-178B,  provide  guidance  on  the  degree 
of  coverage  necessary  for  software  with  different  levels  of  criticality.  For  example,  the  most  criti¬ 
cal  (Level  A)  software  (which  is  defined  as  that  which  could  prevent  continued  safe  flight  and 
landing  of  an  aircraft)  must  satisfy  a  level  of  coverage  called  Modified  Condition/Decision  Cover¬ 
age  (MC/DC).  In  other  cases,  DC,  branch  coverage,  or  statement  coverage  is  sufficient.  Flowever, 
confusion  exists  among  practitioners  as  to  the  appropriate  use  of  these  different  testing  tech¬ 
niques.  Quoting  the  FAA  Certification  Authorities  Software  Team  (CAST)  [FAA  2010]: 

The  issue  is  that  at  least  some  industry  participants  are  not  applying  the  “literal”  definition 
of  decision.  I.e.,  some  industry  participants  are  equating  bi~anch  coverage  and  decision  cov¬ 
erage,  leading  to  inconsistency  in  the  interpretation  and  application  of  DC  and  MC/DC  in 
the  industry.  Tool  manufacturers,  in  particular,  tend  to  be  inconsistent  in  the  appr~oach,  since 
many  of  them  come  from  a  ‘tr~aditionaT  testing  backgr~ound  (using  the  IEEE  definitions),  r~a- 
ther  than  an  aviation  background. 

A  complicating  factor  is  that  embedded  software  executes  as  an  interacting  set  of  concurrent  tasks 
that  operate  on  time-sensitive  data  and  events.  Such  software  may  encounter  aspects  that  appear  to 
occur  randomly  and  are  difficult  to  test  systematically,  such  as  for  race  conditions,  unexpected 
latency  jitter,  and  unanticipated  resource  contention.  Given  the  exponential  growth  in  software 
size  and  interaction  complexity,  the  concept  of  exhaustive  testing  has  turned  into  testing  until  the 
budget  or  schedule  has  been  exhausted  [Boydston  2009]. 

Leveson  observes  that  systems  will  tend  to  migrate  toward  states  of  higher  risk  with  respect  to 
safe  operation  [Leveson  2009].  There  are  three  reasons  for  this  trend:  (1)  the  impact  of  the  opera¬ 
tional  environment,  (2)  unintended  effects  of  design  changes,  and  (3)  changes  in  software  devel¬ 
opment  processes,  practices,  methods,  and  tools. 

One  reason  for  the  trend  toward  higher  risk  is  that  the  operational  environment  impacts  the  safety 
of  systems.  For  example,  the  system  may  be  deployed  in  new  unintended  operational  environ¬ 
ments  that  may  introduce  new  hazards  and  may  violate  assumptions  made  about  the  operational 
environment  when  it  was  designed — resulting  in  unexpected  behavior.  Similarly,  changes  to  oper¬ 
ational  procedures  and  guidelines,  whether  as  the  result  of  operational  budget  reductions  or  as  a 
work-around  to  compensate  for  known  and  correctable  system  faults  (see  Section  2.6)  may  also 
contribute  to  increased  risk. 

A  second  reason  for  this  trend  is  that  modifications  to  software  are  design  changes,  whether  they 
are  the  addition  of  new  functionality  or  corrections  to  existing  code.  Compared  to  hardware,  soft¬ 
ware  experiences  rapid  design  evolution.  In  particular,  the  addition  of  new  mission  capability  and 
operational  features  is  a  common  occurrence,  since  software  can  be  easily  changed.  Similarly, 
technology  upgrades  to  the  computer  system  impact  the  embedded  application  software.  Exam¬ 
ples  range  from  the  introduction  of  multi-core  processors  or  a  migration  from  deterministic  net¬ 
work  protocols  in  a  federated  architecture,  to  a  publish-subscribe  paradigm  on  top  of  a  high-speed 
ethemet  with  nondeterministic  network  protocols.  Design  changes  can  result  in  unintended  feature 
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interactions  and  the  violation  of  assumptions  due  to  paradigm  shifts.  A  study  of  15  operating  sys¬ 
tem  releases  from  10  vendors  shows  that  failure  rates  over  multiple  major  releases  stay  high  and 
may  even  increase  [Koopman  1999].  The  result  is  a  failure  density  curve  across  multiple  releases, 
whose  shape  is  illustrated  categorically  in  Figure  7. 


Time 


Figure  7:  Failure  Density  Curve  Across  Multiple  Software  Releases 

A  third  reason  for  the  trend  toward  higher  risk  is  that  change  occurs  in  the  processes,  practices, 
methods,  and  tools  used  in  the  development  of  embedded  software.  For  example,  although  Ada 
has  shown  to  be  an  excellent  choice  for  the  development  of  highly  reliable  software,  today’s  mar¬ 
ketplace  demands  the  use  of  C,  C++  and  Java.  The  reason  is  simple:  Ada  programming  skills  are 
scarce,  while  C,  C++  and  Java  programming  skills  are  plentiful.  This  has  led  to  efforts  in  teaching 
developers  safe  use  of  such  languages,  as  in  Seacord’s  work  with  C-based  languages  [Seacord 
2008].  Similarly,  object-oriented  design  methods  now  expressed  in  Unified  Modeling  Language 
(UML)  were  not  originally  developed  with  safety-critical  embedded  software  systems  in  mind, 
and  retrofitting  such  notations  with  process  standards,  such  as  Motor  Industry  Software  Reliability 
Association  (MISRA)  C  and  C++,  to  address  real-time,  reliability,  and  safety  concerns  is  an  ongo¬ 
ing,  slow  process  fraught  with  pitfalls. 

In  summary,  there  is  a  need  to  do  the  following: 

•  monitor  leading  indicators  of  increased  risk  in  evolving  software-reliant  systems 

•  investigate  potential  new  problem  areas  and  hazards  arising  from  major  capability  and  tech¬ 
nology  upgrades 

•  revise  the  processes,  practices,  methods,  and  tools  used  to  address  these  risk  areas. 

2.9  Limited  Confidence  in  Modeiing  and  Anaiysis  Resuits 

Model-based  engineering  is  considered  key  to  improving  system  engineering  and  embedded  soft¬ 
ware  system  engineering.  Modeling,  analysis,  and  simulation  have  been  practiced  by  engineers  for 
a  number  of  years.  For  example,  design  engineers  have  created  computer  hardware  models  in 
Very  Fligh-speed  Integrated  Circuits  (VHSfC)  Flardware  Description  Language  (VFIDL)  and  val¬ 
idated  them  through  model  checking  [VHDL  NG  1997].  Control  engineers  have  used  modeling 
languages  such  as  MATLAB/Simulink  to  represent  the  physical  characteristics  of  the  system  to  be 
controlled  and  the  behavior  of  the  control  system.  Characteristics  of  physical  system  components. 
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such  as  thermal  properties,  fluid  dynamics,  and  mechanics,  have  been  modeled  and  simulated  at 
various  levels  of  fidelity. 


Even  for  software  systems  analysis,  models  and  simulation  have  proven  useful  in  predicting  vari¬ 
ous  operational  aspects.  However,  aircraft  industry  experience  has  shown  that  analysis  models 
maintained  independently  by  different  teams  result  in  a  multiple  truth  problem  [Feiler  2009a]  (see 
Figure  8).  Such  models  are  created  at  different  times  during  the  development  based  on  an  evolv¬ 
ing  design  document  and  are  rarely  kept  up-to-date  with  the  evolving  system  design.  The  incon¬ 
sistency  between  the  analysis  models,  with  respect  to  the  system  architecture  and  the  system  im¬ 
plementation,  renders  the  analysis  results  of  little  value  in  the  qualification.  A  need  exists  for  an 
architecture-centric  reference  model  approach  as  a  common  source  for  system  analysis  and  system 
generation. 


System  evolving  over  time 

Inconsistency  of  independentiy  maintained  anaiysis  modeis 
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Figure  8:  Pitfalls  in  Modeling  and  Analysis  of  Systems 

2.10  Summary 

The  current  best  development  practice — relying  on  process  standards,  best  practices,  and  safety 
culture — is  unable  to  accommodate  the  exponential  increase  in  the  size  and  interaction  complexity 
of  embedded  software  in  today’s  increasingly  software-reliant  systems.  Traditional  reliability  en¬ 
gineering  has  its  roots  in  hardware  and  assumes  slowly  evolving  system  designs  with  reliability 
metrics  focusing  on  physical  wear.  During  system  design,  reliability  improvement  is  achieved 
through  modeling  and  analysis  to  identify  failure  modes.  In  contrast,  software  failure  modes  are 
primarily  design  errors,  and  software  is  often  assumed  to  be  perfectible  and  to  show  deterministic 
behavior.  With  current  best  practices  as  shown  in  Figure  3,  70%  of  errors  are  introduced  during 
requirements  and  system  and  software  design,  while  80%  of  errors  are  not  discovered  until  inte¬ 
gration  and  acceptance  testing.  There  is  a  clear  need  to  reduce  the  leakage  rates  of  requirements 
and  design  errors  into  later  development  phases. 

Moving  beyond  or  augmenting  the  textual  specification  of  requirements  is  essential  to  validating 
specification  analytically.  Similarly,  we  need  architectural  models  with  well-defined  semantics 
that  support  analysis  of  nonfunctional  mission  and  safety-critical  requirements.  Only  through  such 
models  can  we  understand,  early  in  the  life  cycle,  how  such  system  properties  are  impacted  by 
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•  architectural  decisions 

•  identifying  potentially  mismatched  assumptions  in  system  interactions. 

In  system  safety  analysis,  we  must  take  into  account  hazards  due  to  software  malfunction.  Fault 
management  has  increasingly  become  the  responsibility  of  the  embedded  software,  requiring  in¬ 
creased  scrutiny  in  order  to  achieve  reliability  and  safety  goals.  The  complexity  and  the  non- 
deterministic  nature  of  software  interaction  require  the  use  of  formal  static  analysis  methods  to 
increase  our  confidence  in  system  operation  beyond  testing.  Flowever,  analysis  results  add  little 
confidence  to  the  testing  evidence  for  system  qualification  unless  consistency  across  analysis 
models  is  maintained.  Operational  work-around,  instead  of  correction,  to  address  software  design 
problems  is  not  a  sustainable  option  and  results  in  reliability  degradation  over  time. 
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3  A  Framework  for  Reliability  Validation  and  Improvement 


In  this  section,  we  introduce  a  framework  for  reliability  validation  and  improvement  of  software- 

reliant  systems.  The  framework  integrates  four  engineering  technologies  into  a  practice  that  im¬ 
proves  both  the  development  and  qualification  of  such  systems  by  addressing  the  challenges  out¬ 
lined  in  the  previous  section.  Figure  9  illustrates  the  interplay  among  four  technologies: 

1 .  formalization  of  mission  and  safety-criticality  requirements  at  the  system  and  software  level 

2.  architecture-centric,  model-based  engineering 

3.  static  analysis  of  mission  and  safety-criticality -related  system  properties 

4.  system  and  software  assurance. 

As  Figure  9  illustrates,  the  technologies  interact  in  the  proposed  framework,  as  follows: 

•  Formalization  of  requirements  establishes  a  level  of  confidence  by  assuring  consistency  of  the 
specifications,  as  well  as  their  decomposition  into  subsystem  requirements.  The  requirements 
are  decomposed  in  the  context  of  an  architecture  specification. 

•  The  architecture  design  is  expressed  in  a  notation  with  well-defined  semantics  for  the  archi¬ 
tectural  structure,  interaction  topology,  and  dynamics  of  the  embedded  software,  the  computer 
system,  and  the  physical  mission  system,  refined  into  component  models  with  detailed  de¬ 
signs,  and  translated  into  an  implementation.  This  set  of  evolving  models  is  the  basis  for  vir¬ 
tual  integration,  that  is,  the  integration  of  the  system  through  its  models.  Virtual  integration 
allows  for  incremental  verification  and  validation  (V&V)  of  mission-related  and  safety- 
criticality-related  system  properties  through  static  analysis  and  simulation.  The  annotated  ar¬ 
chitecture  model  in  the  model  repository  is  the  source  of  automatically  derived  analysis  mod¬ 
els  and  auto-generated  implementations  where  possible. 

•  The  application  of  static  analysis,  such  as  formal  methods,  to  requirements,  architecture  spec¬ 
ifications,  detailed  designs,  and  implementations  leads  to  an  end-to-end  V&V  approach. 

•  Assurance  cases  provide  a  systematic  way  of  establishing  confidence  in  the  qualification  of  a 
system  and  its  software.  They  do  so  through 

recording  and  tracking  the  evidence  and  arguments,  as  well  as  context  and  assumptions, 
that  the  claims  of  meeting  system  requirements  are  satisfied  by  the  system  design  and 
implementation,  and 

making  the  argument  that  the  evidence  is  sufficient  to  provide  justified  confidence  in  the 
qualification. 

The  assurance  case  methodology  addresses  both  evidence  regarding  the  system  design  and 
implementation  and  evidence  regarding  the  application  of  V&V  methods. 
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From  System  Requirements  to 
Software  Requirements 

•  Formalized  requirements 

•  Focus  on  safety-criticality 
requirements 


Mission 

Requirements 

•  Function 

•  Behavior 

•  Performance 


Safety -Criticality 
Requirements 

•  Reliability 

•  Safety 

•  Security 


System  and  Software  Assurance 

■  confidence  that  mission  and  safety- 
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Figure  9:  Reliability  Validation  and  Improvement  Framework 


This  framework  changes  the  traditional  software  development  model.  The  revised  development 
model  for  software-reliant  systems  is  shown  in  Figure  10.  This  revised  model  consists  of  two  Vs 
reflecting  the  development  process  {build  the  system)  and  the  qualification  process  {build  the  as¬ 
surance  case)  for  the  system.  Both  are  affected  by  the  use  of  architecture  modeling,  analysis,  and 
generation  technology.  The  build  the  system  development  process  covers  the  life  cycle  ranging 
from  formalized  requirement  specification  and  architecture  design,  detailed  design,  and  code  de¬ 
velopment,  through  integration,  target,  and  deployment  build.  The  build  the  assurance  case  quali¬ 
fication  process  comprises  the  traditional  unit  test,  integration  test,  system  test,  and  acceptance 
test  phases.  In  addition  it  covers  early-life-cycle  phases  that  bring  increased  justified  confidence 
in  the  system,  such  as  requirements  validation,  system/software  architecture  V&V,  and  design 
V&V  through  static  analysis  and  simulation. 
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Figure  10:  Revised  System  and  Software  Development  Model 
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We  affect  the  development  of  software-reliant  systems  by  discovering  errors,  in  particular  system- 
level  errors,  earlier  in  the  development  life  cycle  than  is  done  in  current  practice.  This  reduces  the 
leakage  of  errors  to  later  phases  and  the  need  for  rework  and  retest — a  major  cost  driver  in  today’s 
development.  In  the  process,  we  ensure  that  all  development  artifacts — from  requirements,  to  ar¬ 
chitecture  and  design  models,  to  implementations  and  build  and  deployment  configurations — are 
managed  in  a  consistent  manner  throughout  the  development  life  cycle. 

We  also  affect  the  V&V  with  the  objective  of  improving  the  qualification  of  software-reliant  sys¬ 
tems  by  building  the  assurance  case  throughout  the  life  cycle  to  increase  our  confidence  in  the 
qualified  system.  In  the  process,  we  ensure  that  all  qualification  evidence,  ranging  from  validated 
requirements  to  analyzed  and  verified  models  and  implementations,  is  managed  in  a  consistent 
maimer  and  evolves  in  the  context  of  previously  validated  artifacts. 

Finally,  we  can  achieve  cost-effective  reliability  improvement  by  focusing  on  high-risk  system 
concerns,  such  as  system  interaction  complexity  and  safety-criticality  requirements  as  well  as 
high-payoff  areas,  namely,  system-level  problems  that  currently  leak  into  system  integration  test 
and  later  phases.  We  achieve  this  improvement  by  using  virtual  integration  of  architecture  models 
with  well-defined  semantics  and  performing  end-to-end  validation  of  system  properties.  The  re¬ 
sult  is  a  reduction  in  error  leakage  to  later  phases  and  a  major  reduction  in  rework/retest  cost. 

We  proceed  by  discussing  each  of  the  four  technologies  in  terms  of  their  state  of  the  art,  contribu¬ 
tion  to  reliability  improvement,  and  interactions  with  the  other  technologies  in  the  framework. 
Then  we  will  outline  our  approach  for  the  proposed  reliability  improvement  metrics. 

3.1  Formalized  System  and  Software  Requirements  Specification 

Requirements  are  typically  divided  into  business  requirements,  process  requirements,  and  product 
requirements.  We  are  focusing  on  product  requirements.  Requirements  are  also  divided  into  func¬ 
tional  requirements  (what  the  system  is  to  do)  and  nonfunctional  requirements  on  the  operation 
(performance,  safety,  security,  availability,  etc.)  and  on  the  design  (modifiability,  maintainability, 
etc.),  as  well  as  constraints  on  the  solution  (e.g.,  use  of  specific  technology).  A  common  view  for 
software  engineering  has  been  that  only  functional  requirements  can  be  implemented  by  software 
and  that  nonfunctional  requirements  can  be  addressed  only  in  the  context  of  the  system  in  which 
the  software  is  deployed.  This  separation  between  system  engineering  and  software  engineering 
has  led  to  system  integration  problems  [Boehm  2006],  in  particular  for  software-reliant  systems. 

There  is  a  clear  need  for  co-engineering  of  system  and  software  that  spans  from  requirements  to 
architecture  design,  detailed  design,  and  implementation  and  that  uses  formal  validation  [Boehm 
2006,  Bozzano  2010].  We  need  to  capture  the  shalls  of  a  system,  which  tend  to  focus  on  achiev¬ 
ing  the  mission  under  normal  conditions.  These  are  the  mission  requirements.  We  also  must  cap¬ 
ture  the  shall  nots,  which  describe  how  the  system  is  expected  to  perform  when  things  go  wrong. 
These  are  the  safety-criticality  requirements. 

Mission  requirements  address  functionality,  behavior,  and  performance  under  normal  conditions. 
Safety-criticality  requirements  address  safety,  reliability,  and  security,  which  often  involve  per¬ 
formance  under  stress  or  failure  conditions  (see  Figure  1 1). 
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Figure  1 1:  Mission  and  Safety-Criticaiity  Requirements 

In  Section  2.3,  we  identified  a  clear  need  for  improving  requirements  capture  and  validation.  A 
recent  industry  survey  [FAA  2009b]  indicates  that  in  DO-178B-compliant  practices,  requirements 
are  captured  in  structured  text  {shall  statements)  with  traceability  to  the  design  and  code  as  re¬ 
quired  by  practice  standards.  The  challenge  is  how  to  formalize  the  specification  of  requirements 
without  overwhelming  the  stakeholders  in  a  system  with  the  formalisms.  We  proceed  by  summa¬ 
rizing  the  state  of  best  practice  in  formalized  requirement  specification,  linking  the  specification 
of  requirements  to  the  interactions  of  the  system  with  its  environment,  and  then  discussing  a  haz¬ 
ard-focused  framework  for  safety-criticality  requirements. 

3.1.1  A  Practical  Approach  to  Formalized  Requirements  Specification 

A  method  to  capture  requirements  known  as  Software  Cost  Reduction  (SCR)  uses  a  Four  Varia¬ 
ble  model  that  relies  on  monitored  and  controlled  variables  on  the  system  side  and  input  and  out¬ 
put  variables  on  the  software  side  to  relate  system  requirements  and  software  requirements  [Par- 
nas  1991].  Miller  has  proposed  an  extension  to  the  model  that  uses  tables  to  represent  system  state 
and  event/action  relations  to  specify  desired  behavior  and  recommends  documentation  of  envi¬ 
ronmental  assumptions  [Miller  2001].  This  tabular  form  facilitates  coverage  and  consistency 
checking  [Heitmeyer  1995]. 

The  Requirements  State  Machine  Language  (RSML)  method  refines  the  tabular  representation 
and  adds  diagrams  to  improve  the  representation  of  state  behavior  [Leveson  1994].  Intent  Specifi¬ 
cations  provide  an  approach  to  human-centered  requirement  specification  [Leveson  2000].  A 
commercial  toolset  supporting  the  Intent  Specification  approach,  the  Specification  Toolkit,  and 
Requirements  Methodology  (SpecTRM)  [Lee  2002]  includes  a  behavioral  specification  language, 
SpecTRM-RL,  which  is  similar  to  RSML. 

Goal-oriented  Requirements  Engineering  (GORE)  [Dardenne  1993,  Letier  2002,  van  Lamsweerde 
2004b]  goes  one  step  further  by  including  goals  and  constraints  in  the  requirement  specification 
formalism  to  complement  the  event/action  model  in  order  to  better  capture  nonfunctional  proper¬ 
ties  [van  Lamsweerde  2000,  2004a].  A  toolset  called  Formal  Analysis  Using  Specification  Tools 
(FAUST)  supports  the  capture  and  analysis  of  GORE  specifications  [Rifaut  2003]. 

Graphical  design  methods  that  are  often  already  familiar  to  engineers  include  representations  for 
state  machines  for  modeling  behavior,  such  as  UML  State  Charts  and  Simulink  State  Flow.  Engi¬ 
neers  often  combine  these  with  user  scenarios  (e.g.,  expressed  graphically  in  the  use  case  tech¬ 
nique  of  UML),  to  express  user  needs  by  detailing  scenario-driven  threads  through  system  func¬ 
tions  with  the  objective  of  helping  derive  a  system’s  behavioral  requirements. 

Based  on  a  study  of  best  practice  [FAA  2009b],  the  FAA  developed  a  handbook  that  provides 
practical  guidance  in  formalized  requirements  capture  [FAA  2009a]. 
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A  reason  for  increasing  formality  in  requirement  specification  is  to  allow  for  the  validation  of  re¬ 
quirements  with  respect  to  consistency  constraint  satisfaction.  We  achieve  this  by  using  tools  that 
check  for  consistency  of  the  specification  [Heitmeyer  1995,  Lee  2002]  and  by  transforming  the 
specification  into  a  formalism  that  allows  for  analysis  by  formal  methods  such  as  model  checkers 
and  provers  for  industrial  applications  [Miller  2010]. 

3.1.2  Interaction  Between  the  System  and  Its  Environment 

We  need  to  specify  not  only  how  a  system  responds  to  input,  but  also  how  it  interacts  with  its  en¬ 
vironment  in  other  ways  (as  shown  in  Figure  12).  The  Association  Franpaise  d’lngenierie  Systeme 
(AFIS)  has  defined  a  process  called  CPRET  that  reflects  this  view;  CPRET  is  “a  set  of  behaviors 
by  execution  of  functions  to  transform  input  into  output  utilizing  state,  respecting  con¬ 
straints/controls,  requiring  resources,  to  meet  a  defined  mission  in  a  given  environment”  [AFIS 
2010].  Typically  requirement  specification  of  systems  focuses  on  the  state  and  behavior  of  the 
system  and  the  input/output  conditions.  This  definition  provides  a  more  comprehensive  coverage 
of  system  requirements  by  including  external  control  imposed  on  the  system  and  resources  re¬ 
quired  by  the  system.  In  other  words,  a  system  has  four  types  of  interaction  points  with  its  envi¬ 
ronment:  input,  output,  imposed  constraints/control,  and  resource  requirements. 


I _ j 


Figure  12:  The  System  and  Its  Environment 

When  taking  a  systems  view,  we  see  that  the  environment  itself  is  a  collection  of  systems.  Any 
system  can  be  a  physical/mechanical  system,  a  computer  system,  a  software  system,  one  or  more 
human  roles,  or  a  combination  thereof.  The  system  of  interest  interacts  with  one  or  more  systems 
in  the  environment,  the  combination  forming  a  composite  system.  The  system  of  interest  may  it¬ 
self  be  composed  of  interacting  systems  to  act  as  a  whole. 

Different  types  of  system  interactions  are  illustrated  in  Figure  13.  The  interactions  may  be  in 
terms  of  (1)  cooperating  systems,  (2)  systems  that  act  as  resources  to  the  system  of  interest,  (3) 
systems  that  control  or  constrain  the  operation  of  the  system  of  interest,  or  (4)  systems  in  the  envi¬ 
ronment  that  may  not  directly  interact  but  are  affected  by  the  operation  of  a  system.  The  latter, 
although  they  do  not  directly  interact  with  the  system  of  interest,  may  still  represent  hazards  that 
affect  the  ability  of  the  system  to  achieve  its  mission  by  acting  as  obstacles  or  by  competing  for 
the  same  resources. 

It  is  desirable  to  capture  expectations  and  assumptions  about  these  four  interaction  points  in  re¬ 
quirements.  We  can  typically  express  them  in  requirements  by  taking  into  account  their  temporal 
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aspects.  For  example,  we  may  have  requirements  on  the  rate  and  the  latency  of  data  in  a  data 
stream,  on  the  order  of  events  or  commands  in  an  event  of  command  sequence,  or  on  the  usage 
pattern  of  resources.  Some  of  these  aspects  have  been  incorporated  into  requirement  specification 
methods  and  into  the  FAA  handbook  mentioned  in  Section  3.1.1. 


Figure  13:  The  Environment  as  a  Coiiection  of  Systems 

The  purpose  of  expressing  requirements  is  to  specify  the  behavior  expected  of  a  system  (shown  in 
the  left-hand  image  of  Figure  14).  Users  of  a  system  have  a  set  of  expected  behaviors  in  mind. 

The  intent  of  requirements  is  to  define  what  is  expected,  but  not  all  of  them  may  be  captured  and 
discovered  in  the  course  of  system  development.  Unexpected  behavior  is  erroneous  behavior  that 
may  be  tolerated  or  mitigated.  The  behavior  allowed  by  a  requirement  specification,  however, 
may  not  encompass  all  of  the  expected  behavior,  and  it  may  even  include  unexpected  behavior  (as 
shown  in  the  middle  image  of  Figure  14).  A  system  design  or  implementation  exhibits  actual  be¬ 
havior.  This  behavior  may  be  within  the  specified  behavior  (i.e.,  meet  the  requirements)  or  may  be 
outside  the  specified  behavior  (as  shown  in  the  right-hand  image  of  Figure  14). 


Figure  14:  Expected,  Specified,  and  Actuai  System  Behavior 

Requirement  verification  ensures  that  the  actual  behavior  exhibited  by  the  design  or  implementa¬ 
tion  satisfies  the  requirement  specification.  Note  that  a  design  or  implementation  may  meet  its 
requirements,  but  still  exhibit  unexpected  behavior.  Therefore,  requirement  validation  and  verifi¬ 
cation  are  equally  important. 
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Actual  behavior  that  is  unexpected  behavior  presents  hazards  in  that  it  deviates  from  the  user’s 
expectations.  Similarly,  actual  behavior  beyond  specified  behavior  presents  hazards  because  in¬ 
teracting  systems  have  been  designed  against  the  specification.  These  hazards,  if  not  addressed, 
manifest  themselves  as  error  propagations  that  can  result  in  unexpected  behavior. 

The  purpose  of  safety-criticality  requirements  is  to  specify  how  to  deal  with  the  hazards  of  unex¬ 
pected  behavior.  This  is  the  topic  of  the  next  section. 

3.1.3  Specifying  Safety-Criticaiity  Requirements 

In  this  section,  we  examine  safety  engineering  approaches  and  then  propose  a  generalized  hazard 
analysis  framework  that  can  be  supported  by  an  architecture-centric  model-based  engineering  ap¬ 
proach  based  on  the  SAE  AADL  [SAE  2004-2012]. 

Safety-criticality  requirements  address  three  system  properties. 

1 .  reliability:  the  ability  of  a  system  to  continue  operating  despite  component  failure  or  unex¬ 
pected  behavior  by  system  components  or  the  environment 

2.  safety:  the  ability  to  prevent  unplanned  and  unacceptable  loss  as  a  consequence  of  hazards 
(failure,  damage,  error,  accident,  injury,  death) 

3.  security:  the  ability  to  prevent  error  propagation  and  the  potential  harm  that  could  result  from 
the  loss,  inaccuracy,  alteration,  unavailability,  or  misuse  of  the  data  and  resources  that  a  sys¬ 
tem  uses,  controls,  and  protects. 

Satisfying  any  one  of  those  system  properties  presents  challenges.  But  all  three,  together  with  re¬ 
al-time  performance,  must  be  satisfied  at  the  same  time,  an  achievement  that  requires  compatibil¬ 
ity  across  the  properties  and  consistency  across  the  approaches  supporting  their  analysis  [Rushby 
1994].  However,  current  practice  tends  to  treat  each  of  these  safety-criticality  dimensions  sepa¬ 
rately. 

Safety  engineering  practice  standards,  such  as  the  FAA  Systems  Safety  Handbook  [FAA  2000] 
and  the  Safety  Assessment  Process  on  Civil  Airborne  Systems  and  Equipment  [SAE  1996]  rec¬ 
ommend  hazard  identification  through  FHA  that  involves  identification  of  failure  modes  and  their 
manifestation  in  terms  of  externally  observable  behavior  This  step  is  typically  followed  by  a  sys¬ 
tem  safety  analysis  (SSA)  and  common  cause  analysis  (CCA),  for  analyzing  the  effect  of  hazards 
and  its  potential  negative  impact.  As  the  system  design  progresses,  fault  tree  analysis  (FTA)  and 
failure  mode  and  effect  analysis  (FMEA)  are  used  to  aid  in 

•  understanding  the  impact  of  faults  and  hazards  within  the  system 

•  making  architectural  decisions  to  prevent  or  manage  faults  and  hazards  in  order  to  satisfy  the 
reliability  or  safety  requirements. 

These  hazards  are  then  translated  into  safety  requirements  on  the  system  or  its  components.  This 
translation  involves  specifying  additional  constraints  on  the  state  and  behavior  of  the  system  and 
its  interactions  with  the  environment.  These  constraints  must  be  satisfied  in  order  to  prevent  the 
hazard  from  propagating  or  to  determine  whether  it  is  an  acceptable  risk  to  propagate  the  hazard. 
The  progression  from  hazard  analysis  to  requirement  specification  has  been  demonstrated  by 
Leveson  and  by  Miller.  Leveson’s  team  integrated  the  hazard  analysis  method  called  “STAMP  to 
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Prevent  Accidents”  (STPA),  which  draws  on  (STAMP)'^  [Leveson  2009],  with  intent  specifica¬ 
tion  via  the  Specification  Tools  and  Requirements  Management  (SpectTRM)  toolset  and  Spec- 
TRM-RL  modeling  language  [Herring  2007,  Owens  2007].  Miller  and  associates  integrated  FHA 
and  FTA  with  a  state-based  safety  requirement  specification  that  can  drive  V&V  through  model 
checking  and  proof  engines  [Tribble  2002,  Miller  2005a,  2005b]. 

3.1 .4  An  Error  Propagation  Framework  for  Safety-Criticality  Requirements 

Figure  15  illustrates  an  error  propagation  framework  that  provides  a  unified  view  of  safety,  relia¬ 
bility,  and  security  faults  and  hazards.  This  framework  is  reflected  in  the  Error  Model  Annex  of 
the  AADL  standard,  providing  a  single  architecture-centric  model  source  for  driving  the  analysis 
and  validation  of  all  three  safety-criticality  requirements  (see  Section  3.2.1).  ft  consists  of  the  abil¬ 
ity  to  specify 

•  sources  of  errors,  such  as  faults,  expressed  as  error  events 

•  the  error  behavior  of  systems  and  system  components  in  terms  of  error  states  and  transitions 
to  represent  error-free  and  failure  behavior 

•  the  ability  of  other  systems  or  system  components  to  observe  the  failures  of  a  system  or  sys¬ 
tem  component  in  the  form  of  error  propagations. 

Error  propagations  represent  failures  of  a  component  that  potentially  impact  other  system  compo¬ 
nents  or  the  environment  through  interactions,  such  as  communication  of  bad  data,  no  data,  and 
early/late  data.  Such  failures  represent  hazards  that  if  not  handled  properly  can  result  in  potential 
damage.  The  interaction  topology  and  hierarchical  structure  of  the  system  architecture  model  pro¬ 
vides  the  information  on  how  these  error  behavior  state  machines  interact  through  error  propaga¬ 
tion. 

The  concepts  of  expected  and  unexpected  behavior  illustrated  in  Figure  14  relate  to  the  concepts 
of  error  events,  error  states,  and  error  propagations  as  follows.  Unexpected  behavior  can  be  due  to 
a  fault  in  the  system  or  system  component,  which  is  expressed  as  error  event.  Unexpected  behav¬ 
ior  can  also  be  due  to  error  propagation  from  interacting  systems  or  the  environment.  A  fault  in 
the  system  can  be  a  requirement  fault  such  as  an  incomplete  or  omitted  requirement,  a  design  fault 
such  as  an  error  in  the  fault  management  logic,  or  an  implementation  fault  such  as  a  coding  error 
in  handling  measurement  units.  The  fact  that  a  system  or  component  has  unexpected  behavior  is 
represented  by  the  component  entering  an  error  state  representing  failure.  An  error  state  represents 
a  hazard  that,  if  not  addressed  locally,  results  in  error  propagation,  which  impacts  interacting  sys¬ 
tem  components  or  the  environment. 

Figure  15  illustrates  the  distinction  between  two  kinds  of  error  propagation.  One  kind  is  propaga¬ 
tion  of  errors  over  modeled  interaction  channels  such  as  port-based  communication.  For  example, 
an  actuator  may  send  the  wrong  voltage  to  a  motor  over  an  electrical  wire,  or  a  software  controller 
may  send  a  control  signal  to  an  actuator  too  late.  This  kind  of  error  propagation  is  shown  as  unex¬ 
pected  interaction  over  specified  interaction  channels.  The  second  kind  of  error  propagation  oc¬ 
curs  between  components  without  explicitly  modeled  interaction  channels.  For  example,  the  in¬ 
creasing  heat  of  a  motor  in  use  may  raise  the  temperature  of  a  nearby  processor,  causing  it  to  ran 
slower;  or  a  piece  of  software  may  exceed  an  array  bound  overwriting  data  of  other  software.  This 


STAMP,  Systems  Theory  Accident  Model  and  Processes,  is  an  accident  model  based  on  systems  theory. 
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kind  of  error  propagation  is  shown  as  unintended  interactions.  Both  these  forms  of  error  propaga¬ 
tions  represent  potential  safety  and  security  hazards. 
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Figure  1 5:  Capturing  Safety-Criticality  Requirements 

A  safety  hazard  is  an  error  propagation  from  the  system  to  the  environment  that  can  cause  damage 
to  a  system  in  the  environment.  A  safety  hazard  also  exists  if  the  system  shows  expected  behavior, 
but  an  interacting  system  in  the  environment  shows  unexpected  behavior  (e.g.,  an  operator  sticks 
his  finger  into  a  fan).  It  can  also  be  due  to  a  change  in  the  environment  that  results  in  behavior 
different  from  that  assumed  by  the  system  (e.g.,  a  software  controller  assumes  metric  units  for 
sensor  data  readings  while  the  new  sensor  provides  them  in  imperial  units). 

Figure  12  in  Section  3.1.2  shows  four  classes  of  specified  interaction  points:  input,  output,  con¬ 
trol,  and  resource.  We  have  incoming  hazards  (expressed  as  error  propagations)  through  inputs 
and  outgoing  hazards  through  outputs.  In  the  case  of  interaction  with  a  resource,  incoming  haz¬ 
ards  are  related  to  available  resource  capacity,  while  outgoing  hazards  are  related  to  resource  de¬ 
mand  by  the  system.  In  the  case  of  the  control  interaction  point,  the  hazards  are  related  to  out¬ 
going  observations  (e.g.,  sensor  output)  and  incoming  control  actions  (e.g.,  actuator  input).  These 
hazards,  if  not  mitigated  locally,  become  error  propagations  to  be  addressed  by  the  recipients. 
These  error  propagations  can  have  the  following  characteristics: 

•  Commission:  providing  output  (data,  events,  signals,  hydraulic  pressure,  etc.),  providing  con¬ 
trol  status,  or  requesting  resources  when  not  expected  propagates  outgoing  error;  receiving 
input,  receiving  control  input,  or  being  allocated  resources  when  not  expected  propagates  in¬ 
coming  error. 

•  Omission:  not  providing  output  or  not  receiving  input  when  expected  (and  equivalent  for  con¬ 
trol  and  resource)  propagates  error. 

•  Bad  value:  inaccuracy,  value  out  of  range,  imprecision,  and  mismatched  unit  are  examples  of 
bad  value. 
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•  Early/late:  for  time-sensitive  interactions  the  input,  output,  control  or  resource  may  be  provid¬ 
ed  too  early  or  too  late. 

•  Too  long/too  short:  for  time-sensitive  interactions  with  duration,  input,  output,  control,  or 
resource  may  be  available  for  too  short  or  too  long  a  time. 

•  For  interactions  that  are  streams  or  sequences,  these  characteristics  may  apply:  wrong  rate, 
variation  in  time  interval,  variation  in  value  difference,  missing  element,  wrong  order. 

We  can  attach  these  error  propagation  types  to  the  appropriate  interaction  point  in  the  system 
specification  together  with 

•  the  condition  that  must  be  violated  for  the  propagation  to  occur  and 

•  an  indication  of  whether  the  error  propagation  must  be  prevented  by  the  outgoing  interaction 
point  or  is  expected  to  be  tolerated  or  handled  by  the  recipient. 

In  other  words,  we  are  recording  fault  prevention  and  mitigation  requirements  that  must  be  ful¬ 
filled  by  the  fault  management  architecture  of  the  system  or  the  tolerance  of  hazards  that  is  ex¬ 
pected  of  the  environment. 

Unintended  error  propagations  occur  because  the  system  and  its  environment  are  not  guaranteed 
to  be  isolated  from  each  other  (e.g.,  they  may  share  resources).  Examples  of  resource  sharing  are 
processor,  memory,  and  network  for  software;  physical  location,  electrical  power,  and  fuel  are 
examples  for  physical  hardware. 

It  is  desirable  to  limit  such  unintended  error  propagations  through  enforceable  isolation  mecha¬ 
nisms,  such  as 

•  radiation  hardening  of  hardware 

•  limits  on  electrical  power  consumption  through  fuses 

•  enforcement  of  value  limits  through  filters 

•  runtime  enforcement  of  address  spaces  or  execution  time  budgets  for  software. 

Using  isolation  mechanisms  can  greatly  reduce  the  types  of  error  propagations  we  have  to  deal 
with.  For  example,  we  can  place  software  subsystems  in  separate  operating  system  processes  with 
virtual  memory  that  enforces  address  space  boundaries  at  runtime.  This  causes  all  faults  within  the 
software  subsystem*^  to  manifest  themselves  as  error  propagations  of  one  of  the  above  types 
through  the  known  interaction  points.  Such  isolation  may  not  always  be  feasible  unless  we  have 
an  understanding  of  the  architecture  of  the  systems  in  the  environment  that  interact  with  the  sys¬ 
tem  of  interest. 

Security  hazards  are  typically  viewed  as  intentional  propagation  of  errors  into  a  system.  They  take 
advantage  of  a  system’s  not  detecting  error  propagation  or  of  faults  within  the  system  that  can 
cause  the  system  to  enter  unexpected  behavior  despite  expected  behavior  by  the  environment.  Un¬ 
expected  behavior  of  the  system  can  also  result  in  error  propagation  that  can  cause  a  security 
problem  in  the  environment.  In  other  words,  security  has  to  be  managed  at  the  architectural  level 
since  it  is  driven  by  the  interaction  between  systems. 


Some  examples  of  these  faults  are  division  by  zero,  index  out  of  bounds,  or  computational  errors. 
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Reliability  requirements  focus  on  the  ability  of  the  system  to  continue  to  operate  despite  failure  of  sys¬ 
tem  components  or  external  hazards  that  cause  system  components  or  the  system  as  a  whole  to  fail. 
Component  failures  are  addressed  through  fault  management  strategies  such  as  redundancy  within  the 
system  or  by  a  composite  system  that  includes  the  system  of  interest.  Hazard-induced  failures  are  also 
addressed  through  fault  detection,  isolation,  and  recovery  (FDIR)  through  the  enclosing  composite 
architecture.  Architectural  solutions  to  fault  and  hazard  management  are  discussed  in  the  next  section. 

3.2  Architecture-Centric,  Modei-Based  Engineering 

Modeling,  simulation,  and  analysis  are  essential  parts  of  any  engineering  discipline.  It  is  im¬ 
portant  that  models  have  well-defined  semantics  in  order  for  their  simulation  and  analysis  to  pro¬ 
duce  credible  results.  Modeling  notations  for  describing  a  system  or  software  present  the  chal¬ 
lenge  of  representing  a  cohesive  and  consistent  whole  when  expressing  different  aspects  of  the 
system  architecture  and  detailed  design  in  different  diagrams  [UML  OMG  2009].  Similarly,  anal¬ 
ysis  models  that  are  maintained  independently  by  different  teams  present  the  challenge  of  produc¬ 
ing  credible  results  by  maintaining  consistency  with  system  architecture,  other  analysis  models, 
and  the  system  implementation  (see  Section  2.9). 

Therefore,  it  is  essential  that  the  model-based  engineering  approach  is  architecture-centric  and 
uses  as  its  foundation  a  standardized  underlying  meta  model  with  well-defined  semantics  that  can 
drive  the  analysis  and  generation  of  a  system.  We  proceed  by  discussing  such  an  architecture¬ 
centric  model-based  engineering  approach  based  on  an  international  industry  standard.  Then  we 
elaborate  on  the  relationship  between  requirements  and  architecture,  focus  on  safety-critical  sys¬ 
tem  architectures,  and  discuss  the  practicality  of  an  industrial  approach  for  architecture-centric 
analysis  and  construction  of  software-reliant  systems. 


Figure  1 6:  Collage  of  UML  Diagrams 

[Wikipedia  2011] 
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3.2.1 


An  Industry  Standard-Based  Approach  to  Architecture-Centric,  Model-Based 
Engineering 


The  SAE  Architecture  Analysis  &  Design  Language  (AADL)  was  developed  specifically  for 
modeling  and  analyzing  the  architecture  of  embedded  real-time  systems  (i.e.,  safety-critical  soft¬ 
ware-reliant  systems  in  support  of  model-based  engineering)  [SAE  2004-2012,  Feiler  2012]. 
AADL  focuses  on  modeling  the  system  architecture  in  terms  of  the  software  architecture,  com¬ 
puter  system  architecture,  and  the  physical  system,  as  well  as  the  interactions  among  the  three,  as 
illustrated  in  Figure  17.  It  supports  component-based  modeling  through  a  textual  and  graphical 
notation,  as  well  as  a  standardized  meta  model  with  well-defined  semantics  for  the  language  con¬ 
cepts  that  provides  a  standardized  extensible  Markup  Language  (XML)  Interchange  (XMl)  for¬ 
mat  for  exchange  of  AADL  models  and  for  interfacing  AADL  models  with  analysis  tools. 


Control 

Guidance 


Focus  on  software 
runtime  architecture 

L 


Embedded 
Application  Software 

Flight  control  &  Mission 


The  Software 


Physical  connectivity 
The  Computer  System 


Depioyed  on 


’  i~ 

Computer  System 

Hardware  &  OS 


Figure  1 7:  Software-Reliant  System  Interactions  Addressed  by  AADL 

AADL  defines  component  concepts  specific  to  representing  the  software  architecture  (process, 
thread,  subprogram,  data),  the  computer  system  (processor,  memory,  bus,  virtual  processor,  virtu¬ 
al  bus),  the  physical  system  (device),  as  well  as  their  aggregation  (subsystem  or  system).  It  in¬ 
cludes  concepts  of  representing  component  interactions  both  at  the  logical  level  (connections  for 
sampled  data  ports,  and  queued  event  and  message  ports,  shared  data  access,  and  service  re- 
quest/call-retum)  and  the  physical  level  (connectivity  via  bus  access),  end-to-end  flow  specifica¬ 
tion  including  flow  abstraction  in  component  interfaces,  operational  modes  to  characterize  dy¬ 
namic  changes  to  the  architecture  such  as  reconfigurations,  and  specification  of  deployment 
binding  of  software  to  hardware. 

AADL  supports  modeling  of  large-scale  systems  and  families  of  systems  in  several  ways.  It  sup¬ 
ports  component-based  modeling  with  separation  of  component  interface  specification  (compo¬ 
nent  type)  and  multiple  variants  of  implementation  blueprints  (component  implementation)  for 
each  interface  specification.  AADL  supports  incomplete  component  specifications  that  can  be 
refined,  including  abstract  components  and  parameterized  component  type  and  implementation 
templates.  Finally  AADL  provides  a  package  concept  similar  to  those  found  in  programming  lan¬ 
guages  to  support  the  organization  of  the  component  specifications  into  a  hierarchy  of  packages. 
AADL  has  a  set  of  predefined  properties.  Its  core  language  is  extensible  through  a  notation  for 
introducing  user-defined  properties  and  an  annex  sublanguage  mechanism  to  introduce  additional 
concepts  that  can  be  associated  with  an  AADL  model  [SAE  2004-2012]. 
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AADL  is  a  notation  with  strong  typing,  whose  value  to  building  reliable  software  systems  has 
previously  been  demonstrated  by  Ada*"^  and  VHDL.  The  AADL  compiler  ensures  model  com¬ 
pleteness  and  consistency:  threads  can  only  be  contained  in  thread  groups  and  processes;  buses 
cannot  have  ports;  the  data  type  of  ports  match;  and  sampling  ports  have  only  one  incoming  con¬ 
nection  per  mode.  The  AADL  standard  defines  the  timing  semantics  of  thread  execution  and 
communication,  including  deterministic  sampling  by  periodic  threads,  as  well  as  mode  transitions. 
This  level  of  specification  leaves  little  room  for  misinterpretation  of  the  intended  execution  be¬ 
havior  and  allows  for  generation  of  analytical  models  such  as  timing  models  for  scheduling  analy¬ 
sis. 


The  extensibility  of  AADL  through  property  sets  and  annex  sublanguages,  whose  meta  model  is 
semantically  consistent  with  the  core  AADL,  allows  AADL  to  support  analysis  along  multiple 
quality  attribute  dimensions  from  the  same  model  source.  This  is  illustrated  in  Figure  1 8  and  has 
been  demonstrated  in  the  context  of  an  open  source  AADL  tool  environment  (OSATE),  which  is 
based  on  Eclipse.’^ 
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Figure  18:  Multidimensional  Analysis,  Simulation,  and  Generation  from  AADL  Models 


More  than  40  research  groups  and  advanced  development  teams  have  integrated  their  formal 
analysis  and  generation  frameworks  into  AADL  as  evidenced  by  reports  in  over  250  publica¬ 
tions.’®  In  this  section,  we  provide  a  sampling  of  that  work;  in  Section  3.2.4,  we  discuss  industrial 
initiatives  that  have  integrated  and  expanded  a  range  of  analysis  and  generation  technologies  into 
AADL  to  advance  architecture-centric,  model-based  engineering. 


For  more  information,  see  the  ongoing  Ada  Europe  Conference  series  examining  reliable  software  technologies 
{http://www.ada-europe.org/). 

For  more  information  on  OSATE,  visit  http://www.aadl.info/aadl/currentsite/tool/osate.html.  For  information  on 
Eclipse,  visit  http://www.eclipse.org/. 

The  public  AADL  wiki  maintains  an  annotated  list  of  publications  related  to  AADL 
(https://wiki.sei.cmu.edu/aadl/index.php/AADL_Related_Publications). 
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Below  are  examples  of  research  and  advanced  development  efforts  where  formal  analysis  and 

generation  frameworks  are  integrated  into  AADL: 

•  Resource  allocation  and  scheduling  analysis  is  supported  by  mapping  AADL  models  into 
timed  Petri  nets,  process  algebras  (VERSA)  [Sokolsky  2009],  timed  automata  (Cheddar) 
[Singhoff  2009],  rate  monotonic  analysis  (RMA),  and  constraint-based  resource  allocation  by 
binpacking  [DeNiz  2008]. 

•  Flow  latency  analysis  is  supported  through  component  input/output  flow  specifications  and 
end-to-end  flows  using  latency  properties  and  interpreting  the  execution  and  communication 
timing  semantics  of  partitions  and  threads,  as  well  as  the  hardware  they  are  deployed  on 
[Feiler  2008]. 

•  Security  analysis  with  respect  to  confidentiality  is  supported  by  mapping  the  Bell  LaPadula 
security  analysis  model  into  AADL  through  a  set  of  security  properties  and  its  analysis  sup¬ 
ported  through  an  OSATE  plug-in.  Later  this  has  been  extended  to  support  security  analysis 
in  the  context  of  Multiple  Independent  Levels  of  Security  (MILS)  architecture  [Flansson 
2009,  Delange  2010a]. 

•  Resource  consumption  analysis  is  supported  for  computer  resources  such  as  processor  cy¬ 
cles,  memory,  and  network  bandwidth,  as  well  as  physical  resources  such  as  electrical  power 
through  resource  capacity  and  budget  properties  [Feiler  2009a]. 

•  Data  quality  analysis  is  supported  through  additional  properties  of  the  data  content,  its  rep¬ 
resentation  in  variable  base  types,  and  in  mapping  of  data  into  different  protocols  [Feiler 
2009a]. 

•  An  Error  Model  Annex  standard  [SAE  2006]  was  added  to  the  AADL  standard  suite  based 
on  fault  concepts  introduced  by  Laprie  [Laprie  1995].  The  annex  supports  the  annotation  of 
AADL  models  with  fault  sources,  error  behavior  of  components  and  systems,  error  propaga¬ 
tion  between  system  components,  and  mappings  between  the  error  behavior  in  the  error  mod¬ 
el  and  the  fault  management  architecture  in  the  core  AADL  model.  The  annotated  model  has 
become  the  source  for  various  forms  of  reliability  and  safety  analysis  ranging  from  reliability 
predictions  such  as  Mean  Time  To  Failure  (MTTF)  to  FFIA,  fault  impact  such  as  FMEA,  and 
FTA  [Rugina  2008,  Joshi  2007,  Bozzano  2009]. 

•  A  Behavior  Annex  standard  [SAE  2011]  has  been  added  to  the  AADL  standard  that  sup¬ 
ports  the  annotation  of  behavioral  specifications  beyond  the  architectural  dynamics  expressed 
by  AADL  modes.  The  Behavior  Annex  focuses  on  specification  of  component  interaction 
semantics  and  individual  component  semantics.  Annotated  AADL  models  have  thus  become 
useful  as  a  source  for  temporal  reasoning  [Berthomieu  2010,  Bozzano  2010]. 

•  The  ARINC653  Annex  standard  [SAE  2011]  combined  with  security  and  safety  properties 
has  been  used  to  validate  safety  and  security  properties  in  partitioned  architectures  [Delange 
2009],  and  implementations  of  validated  ARINC653  architectures  have  been  generated,  in¬ 
cluding  ARINC653-standard-compliant,  XML-based  configuration  files  [Delange  2010a]. 

•  AADL  has  been  integrated  with  detailed  design  speeifleations 

expressed  in  Esterel  Safety  Critical  Application  Development  Environment  (SCADE), 
using  the  SCADE-Simulink  Gateway,  and 

combined  with  data  models  expressed  in  Abstract  Syntax  Notation  version  1  (ASN.l) 
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to  create  system  models,  analyze  them,  and  generate  implementations  [Perrotin  2010,  Raghav 

2010]. 

In  summary,  the  value  of  an  architecture  modeling  notation  with  well-defined  semantics  that  is 
extensible  as  the  source  of  multiple  analysis  dimensions,  simulation,  and  code  generation  has  been 
recognized  by  the  research  community,  evidenced  in  the  analysis  frameworks  integrated  with 
AADL,  and  by  industry,  evidenced  by  the  range  of  pilot  projects  (some  of  which  are  discussed  in 
Section  3.2.4). 

3.2.2  Requirements  and  Architectures 

As  discussed  in  Section  3.1,  there  is  a  need  to  associate  requirements  with  a  system  architecture 
for  two  reasons: 

1 .  The  system  interacts  with  a  set  of  systems  in  its  deployment  environment,  representing  a 
composite  system. 

2.  The  system  requirements  must  be  satisfied  by  a  system  design. 

The  system  architecture,  as  a  result,  imposes  requirements  on  the  subsystems.  Behavioral  specifi¬ 
cations  in  terms  of  state  machines  have  been  decomposed  into  hierarchical  state  machines  and 
analytically  validated  [Heimdahl  1996].  More  recently,  formalized  requirement  specifications 
have  been  associated  with  architecture  models  (e.g.,  requirements  expressed  with  GORE  were 
associated  with  AADL  models  to  drive  several  analyses  [Delehaye  2009]).  The  same  require¬ 
ments  formalism  has  been  explored  as  a  basis  for  system  safety  as  an  emergent  property  in  com¬ 
posite  systems  to  validate  requirements  coverage  and  identify  possible  gaps  by  validating  them 
against  simulations  of  the  system  [Black  2009]. 

Many  requirements  can  be  mapped  into  constraints  on  discrete  system  states,  typically  expressed 
in  terms  of  state  machines,  as  well  as  continuous  value  states  common  in  physical  systems,  typi¬ 
cally  expressed  as  continuous  time  functions  such  as  differential  equations.  It  is  desirable  to  asso¬ 
ciate  both  types  of  constraints  with  components  in  a  system  architecture,  as  done  through  state 
charts  and  parametrics  in  SysML  [SysML.org  2010].  The  same  is  achievable  with  AADL,  and 
proposals  have  been  made  to  the  AADL  standards  committee  to  provide  a  standardized  constraint 
language  annex.  Furthermore,  it  is  desirable  to  support  the  development  of  safety  requirements 
through  the  ability  to  represent  hazards  in  the  context  of  a  system  model  (see  Section  3.1.4), 
which  is  supported  by  the  AADL  Error  Model  Annex  standard. 

As  discussed  in  Section  3.1.2,  requirements  can  be  associated  with  a  system  (component)  itself  if 
it  concerns  the  system  state  or  behavior,  or  with  one  of  its  interaction  points  in  terms  of  in¬ 
put/output,  resource  use,  or  external  control.  Once  thus  related,  these  requirements  are  associated 
with  an  architecture  model  on  which  we  can  perform  checks  on  two  types  of  consistency: 

1 .  consistency  between  requirements  of  subcomponents  and  the  requirements  of  the  containing 
system 

2.  consistency  between  requirements  that  interacting  system  components  place  on  each  other 

In  the  next  section,  we  discuss  how  architecture  patterns  can  be  used  to  address  safety-criticality 
requirements. 
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3.2.3  Safety-Critical,  Software-Reliant  System  Architectures 


Safety-critical,  software-reliant  systems  have  two  elements  to  their  architecture:  (1)  the  nominal 
mission  (application)  architecture  with  its  fault  and  hazard  potential  and  (2)  the  fault  management 
architecture  that  addresses  reliability,  safety,  and  security  concerns  by  mitigating  error  propaga¬ 
tions. 

In  Sections  3.1.2  and  3.1.4,  we  outlined  an  approach  to  model  the  faults,  hazards,  and  safety  re¬ 
quirements  of  a  system  in  terms  of  its  external  interface  and  observable  states  and  behavior.  Here 
we  are  examining 

•  the  interactions  between  system  components  in  terms  of  faults  and  hazards 

•  how  faults  and  hazards  within  the  composite  system  affect  its  realization  as  a  set  of  interact¬ 
ing  components. 

The  simplest  form  of  interaction  is  a  flow  pattern  (pipeline)  of  system  components  that  process  a 
data,  event,  or  message  stream  in  multiple  steps.  In  this  case,  the  faults  and  hazards  associated 
with  the  outgoing  interaction  point  of  one  component  (expressed  as  outgoing  error  propagations  in 
the  AADL  model)  must  be  consistent  with  the  incoming  error  propagations  specified  by  the  recip¬ 
ient.  They  must  be  in  agreement  as  to  which  hazards  and  faults  are  propagated  and  which  ones  are 
expected  to  be  prevented  (masked)  by  the  sender.  In  addition,  the  protocol  used  in  the  interaction 
(e.g.,  a  locking  protocol  for  shared  data  or  a  communication  protocol  for  port  connections)  may 
have  its  own  fault  and  hazard  potential  and  contribute  to  the  potential  error  propagations  to  the 
recipient. 

When  the  components  in  the  pipeline  use  resources,  as  is  the  case  for  software,  we  have  another 
set  of  error  propagation  paths  based  on  the  assignment  of  the  resources  to  the  system  components. 
In  the  case  of  software  components,  this  set  of  paths  is  expressed  by  the  deployment  binding  of 
the  software  to  the  hardware,  as  Figure  19  illustrates.  For  example,  a  processor  failure  may  result 
in  an  omission  hazard  that  is  propagated  to  the  software  component,  which,  in  turn,  results  in  an 
outgoing  omission  hazard.  Similarly,  a  failure  in  the  network  hardware  propagates  as  an  omission 
hazard  to  the  connection,  which,  in  turn,  results  in  an  omission  hazard  of  the  communicated  in¬ 
formation.  The  AADL  Error  Model  Annex  standard  defines  the  possible,  explicitly  modeled,  error 
propagation  paths  in  an  AADL  model  and  provides  a  mechanism  to  express  the  effect  of  an  in¬ 
coming  hazard  (error  propagation)  on  the  error  behavior  of  the  component  itself  (expressed  as  a 
transition  in  error  state  machine)  and  on  the  outgoing  error  propagation  (expressed  as  an  outgoing 
propagation  guard). 
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Figure  19:  Error  Propagation  Across  Software  and  Hardware  Components 


We  can  identify  three  other  application  architecture  interaction  patterns:  (1)  two-way,  peer-to-peer 
cooperation,  (2)  feedback  control  interaction,  and  (3)  multi-layered  service  interactions.  The  co¬ 
operation  pattern  is  shown  in  Figure  20.  In  this  case,  the  two  parties  interact  in  a  two-way  exchange 
of  data,  events,  or  messages  (e.g.,  in  the  form  of  an  application-level  handshaking  or  synchroniza¬ 
tion  protocol).  Each  system  has  a  model  of  the  expected  behavior  and  current  state  of  the  other  sys¬ 
tem  and  acts  accordingly.  An  example  of  a  new  potential  hazard  area  is  that  of  a  deadlock  hazard 
when  the  model  of  the  other  system’s  state  does  not  correspond  to  the  actual  system  state. 


Figure  20:  Peer-to-Peer  Cooperation  Pattern 


Leveson  uses  a  feedback  control  pattern  exclusively  to  analyze  the  safety  of  systems  [Leveson 
2009].  This  pattern  is  shown  in  Figure  21  with  a  controller  and  a  controlled  system  component  as 
well  as  sensor  and  actuator  components  that  connect  the  two.  The  figure  also  shows  various  haz¬ 
ards  that  can  potentially  be  propagated  between  the  components  and  faults  (labeled  with  circled 
numbers)  that  manifest  themselves  in  potential  hazard  propagations. 
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Figure  21:  Feedback  Control  Pattern 

[Leveson  2009] 

Figure  22  depicts  a  system  instance  with  a  combination  of  these  patterns  where  there  is  a  flow  on 
the  left,  peer-to-peer  cooperation  on  the  right,  internal  feedback  control  of  a  resource  on  the  bot¬ 
tom  left,  an  external  feedback  control  on  the  top  center,  and  multiple  service  layers  in  terms  of 
resource  usage.  We  will  use  this  figure  to  discuss  the  relationship  between  the  faults  and  hazard 
propagations  within  a  system  and  their  manifestation  as  potential  hazard  propagations  at  the  sys¬ 
tem  level.  Faults  within  system  components  manifest  themselves  as  propagated  hazards,  and  only 
those  components  that  interact  with  the  system’s  interaction  points  affect  outgoing  hazards  and 
are  affected  by  incoming  hazards. 
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Figure  22:  Multiple  Interaction  Patterns  and  Composite  System 

As  discussed  in  Section  3.1.4,  a  system  or  system  component  can  act  as  an  isolation  boundary  for 
propagation  of  certain  faults  and  hazards.  In  this  case,  we  can  exclude  them  from  consideration  as 
resulting  in  unintended  error  propagation.  In  the  case  of  physical  or  computer  systems,  this  is  typ¬ 
ically  achieved  through  physical  separation.  For  example,  the  system  might  be  a  box  with  a 
known  set  of  connection  sockets;  heat  transfer  could  be  an  unintended  interaction  if  airflow 
through  a  vent  is  not  explicitly  represented.  In  the  case  of  software,  partitioning  in  terms  of  space 
and  time  is  a  key  concept  for  providing  isolation  and  reducing  the  interaction  complexity  of  fault 
and  hazard  propagation  (see  the  work  of  Rushby  [Rushby  1999]).  Partitioning  allows  us  to  map  a 
wide  range  of  software  design  and  coding  faults  (defects)  into  their  manifestation  as  a  much 
smaller  set  of  propagated  hazards  through  explicitly  specified  interaction  points.  For  example, 
logical  design  errors  in  the  decision  logic  or  algorithm,  as  well  as  coding  errors  such  as  array  in¬ 
dex  out  of  bounds,  can  be  mapped  into  externally  observed  error  propagations  in  the  form  of  no 
output,  bad  output,  early/late  output,  and  so  on. 

One  pattern  of  fault/hazard  isolation  uses  partitioning.  Other  fault  management  patterns  are  re¬ 
dundancy  patterns,  and  monitoring  and  recovery  patterns.  Examples  of  redundancy  patterns  are 
replication  patterns  (dual,  triple,  quad)  with  identical  copies  or  dissimilar  instances.  Often  identi¬ 
cal  copies  of  software  are  replicated  across  redundant  hardware  to  accommodate  for  hardware 
redundancy.  Sometimes  software  redundancy  is  addressed  through  independent  software  designs 
(N-Version  programming).  Other  redundancy  patterns  take  the  form  of  a  recovery  pattern  allow¬ 
ing  for  dissimilarity,  such  as  the  Simplex  architecture  [Sha  2009].  This  pattern  supports  tolerance 
to  software  fault  in  software  upgrade  scenarios,  such  as  upgrading  the  controller  software  in  a 
software-fault-resilient  manner.  Monitoring  and  recovery  patterns  represent  the  interactions  be¬ 
tween  the  health  monitoring/fault  management  system  and  the  application  system,  with  observa¬ 
tion  interactions  and  control  interactions,  as  well  as  fault  management  decision  logic. 
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These  fault  management  patterns  have  hazards  in  their  own  right  (see  Section  2.7).  For  example, 
partitions  introduce  timing-related  hazards  due  to  the  virtualization  of  time  (input  is  not  sampled 
at  the  periodic  frame  start  time,  but  rather  at  the  start  time  of  the  partition  slot  within  the  frame) 
and  changes  in  latency  (see  Section  2.4).  Similarly,  a  replicated  identical  software  pattern  to  sup¬ 
port  hardware  redundancy  has  the  hazard  of  a  collocated  deployment  binding.  This  means  that, 
although  the  software  is  replicated,  both  copies  of  the  software  may  be  deployed  on  the  same 
hardware  and  thus  fail  to  provide  redundant  service  if  the  hardware  fails.  Finally,  the  monitoring 
and  recovery  pattern  has  a  number  of  hazards  related  to  the  fault  management  mode  logic  [Miller 
2005a]. 

Finally,  we  need  to  assess  the  existing  fault  management  patterns  for  their  ability  to  provide  ro¬ 
bustness  [DeVale  2002]  (i.e.,  resilience  to  unknown  and  unintended  fault  hazards  [Sha  2009]). 


3.2.4  Piloting  Architecture-Centric,  Model-Based  Engineering 

The  SAE  AADL  standard  has  30  voting  member  organizations  that  have  participated  in  and  con¬ 
tributed  to  the  development  of  the  AADL  standard  suite.  Many  of  them  use  the  technology  of  the 
aircraft,  space,  and  automotive  industry.  They  have  also  become  early  adopters  of  the  standard  as 
a  technology.  Some  of  the  major  industry  initiatives  are  shown  in  Figure  23. 
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Figure  23:  Industry  Initiatives  Using  AADL 

The  first  initiative  was  led  by  the  European  Space  Agency  (ESA).  This  initiative,  called  Automat¬ 
ed  proof-based  System  and  Software  Engineering  for  Real-Time  applications  (ASSERT)  [Con- 
quet  2008]  focused  on  representing  two  families  of  satellite  architecture  in  AADL,  validating 
them,  and  generating  implementations  from  the  validated  architecture.  In  a  follow-on  initiative. 
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the  ESA  is  providing  an  architecture-centric,  model-based  development  process  supported  by  a 
tool  chain,  The  ASSERT  Set  of  Tools  for  Engineering  (TASTE)  [Perrotin  2010]. 

A  second  initiative  was  led  by  Airbus  Industries  to  focus  on  the  creation  of  an  open  source  toolkit 
for  critical  systems  called  TOPCASED  [Heitz  2008].  This  toolkit  is  based  on  Eclipse  and  provides 
a  meta-model-based  environment  for  supporting  model-based  engineering  of  safety-critical  sys¬ 
tems.  It  uses  a  model-bus  concept  to  support  model  transformation  between  different  model  repre¬ 
sentations  in  the  model  repository  and  during  interface  with  analysis  and  generation  tools.  AADL 
is  one  of  the  supported  modeling  notations,  and  OSATE  has  been  integrated  with  TOPCASED. 

Another  European  initiative  is  Support  for  Predictable  Integration  of  mission  Critical  Embedded 
Systems  (SPICES)  [SPICES  2006].  SPICES  integrates  the  use  of  AADL  with  formalized  re¬ 
quirement  specification,  the  Common  Object  Request  Broker  Architecture  (CORE A)  Component 
Model  (CCM),  and  SystemC  into  an  engineering  framework  for  formal  analysis  and  generation  of 
implementations.  Some  examples  of  this  work  are  the  GORE-based  requirements  integration  with 
AADL  [Delehaye  2009],  an  adaptation  of  an  electrical  power  consumption  analysis  toolbox  with 
AADL  [Senn  2008],  and  a  prototype  for  behavioral  verification  of  an  AADL  model  with  Behavior 
Annex  annotations  using  the  Time  petri  Net  Analyzer  (TINA)  toolkit  [Berthomieu  2010].  The 
SPICES  methodology  and  tools  have  been  piloted  on  avionics,  space,  and  telecommunication  ap¬ 
plications. 

Modeling  and  Analysis  of  Real-Time  Embedded  systems  (MARTE)  is  an  effort  by  the  Object 
Management  Group  (OMG)  to  provide  a  UML  profile  for  embedded  system  modeling  that  draws 
on  the  AADL  meta  model  and  its  semantics  and  includes  a  profile  subset  to  represent  AADL 
models  [OMG  MARTE  2009]. 

ARTIST2*^  is  a  primarily  European  network  of  excellence  on  embedded  systems  design  that  pro¬ 
vides  a  foram  for  researchers  and  industry  practitioners  to  exchange  ideas  on  the  advancement  of 
embedded  system  design  technology.  It  has  cosponsored  the  international  UML  and  AADL  work¬ 
shop  series  and  organized  a  number  of  conferences  and  other  workshops  at  which  AADL-related 
work  and  other  model-based  technologies  have  been  presented  and  discussed,  including  a  work¬ 
shop  on  Integrated  Modular  Avionics  (IMA). 

The  Open  Group**  is  an  industry-focused  organization  that  promotes  the  adoption  of  new  technol¬ 
ogies.  Its  Real-Time  Forum  has  examined  AADL  as  a  promising  technology  internationally. 

The  System  Architecture  Virtual  Integration  (SAVI)  [Redman  2010]  is  an  international  aircraft 
industry-wide,  multi-year,  multi-phase  initiative  (see  Figure  24  and  Figure  25)  to  mature  and  put 
into  practice  an  architecture-centric,  model-based  engineering  approach  based  on  industry- 
standard  model  representation  and  interchange  formats.  Members  include  aircraft  manufacturers 
(Boeing,  Airbus,  Lockheed  Martin),  suppliers  (BAE  Systems,  Rockwell  Collins,  GE  Aviation), 
govemment/certification  agencies  (FAA,  NASA,  DoD),  and  the  SET 

The  SAVI  approach  is  to  drive  the  development  and  validation/verification  process  through  a  ref¬ 
erence  model  in  a  standardized  meta  model,  semantics,  and  interchange  format  in  order  to  achieve 


For  more  information,  see  http://  www.artist-embedded.org. 
For  more  information,  see  http://www.opengroup.org. 
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earlier  discovery  of  system-level  problems  in  the  embedded  software  systems.  After  evaluating 
several  architecture  modeling  technologies,  initiative  members  chose  AADL  as  a  key  technology. 
A  single  architectural  reference  model  expressed  in  AADL  represents  the  source  of  the  single 
truth  for  multiple  dimensions  of  analysis,  simulation,  and  generation. 
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Figure  24:  SAV I  Approach 

SAVI  takes  a  multi-notation  approach  to  the  content  of  the  model  repository  that  is  based  on  a 
semantically  consistent  meta  model  of  its  content — the  reference  model —  thus  minimizing  model 
overlap.  At  the  architecture  level,  this  is  achieved  by  expanding  the  meta  model  for  AADL,  which 
supports  SysML®  component  modeling,  to  accommodate  architectural  aspects  of  computer  hard¬ 
ware  expressed  in  VHDL  and  of  mechanical  system  models  expressed  in  notations  such  as  Mod- 
elica.®  At  the  detailed  design  level,  this  modeling  is  complemented  by  detailed  design  notations, 
such  as  physical  system  and  control  dynamics  expressed  in  Simulink,®  or  application  code  ex¬ 
pressed  in  Ada,  C/C++,  or  Java. 

In  Phase  I,  a  proof-of-concept  (POC)  demonstration  was  conducted  to  get  buy-in  from  manage¬ 
ment  of  the  member  companies  to  fund  the  next  phases  of  the  initiative.  The  SAVI  POC  demon¬ 
strated  that  the  SAVI  concepts  of  a  model  repository  and  model  bus  support  an  architecture¬ 
centric,  single-source  representation.  Auto-generation  of  analytical  models  interfacing  with  multi¬ 
ple  analysis  tools  from  annotated  architecture  models  preserves  single-trath  analysis  results.  This 
was  demonstrated  through 

•  multi-tier  modeling  and  analysis  across  system  levels 

•  coverage  of  system  engineering  and  embedded  software  system  analysis 

•  propagation  of  changes  across  multiple  analysis  dimensions 

•  distributed  team  development  via  a  repository  to  support  airframe  manufacturer  and  supplier 
interaction. 
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Figure  25:  A  Multi-Notation  Approach  to  the  SAVI  Model  Repository  Content 

Figure  26  shows  some  of  the  elements  of  this  demonstration.  At  Tier  1,  the  focus  was  on  the  air¬ 
craft  system  and  included  analysis  of  weight  and  power.  At  Tier  2,  the  integrated  modular  avion¬ 
ics  (IMA)  portion  of  the  aircraft  architecture  was  expanded  to  represent  the  computer  platform 


■  Multi-notation  model  repository 

■  Multi-tiersystem  and  software  architecture  (in  AADL) 

■  Integrator  and  subcontractor  virtual  integration 


Figure  26:  SAVI  Proof-of-Concept  Demonstration 
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and  major  embedded  software  subsystems  in  a  partitioned  architecture.  At  this  time,  the  previous 
analyses  were  repeated  on  the  refined  model.  Additional  analyses  included  a  first-level  computer 
resource  analysis  and  end-to-end  latency  analysis  of  two  critical  flows. 

Next,  the  subcontracting  negotiation  phase  between  a  system  integrator  and  suppliers  was  demon¬ 
strated  by  providing  AADL  models  as  part  of  a  request  for  proposals  (RFP)  and  as  part  of  submit¬ 
ted  proposals.  At  that  time,  the  system  integrator  virtually  integrated  the  supplier  models  to  ensure 
consistency  with  the  system  architecture  and  between  the  subsystems;  this  was  demonstrated  by 
the  checking  of  functional  integration  and  of  data  bus  protocol  mappings,  in  addition  to  revisiting 
earlier  analyses  in  order  to  ensure  that  the  analysis  results  still  met  the  requirements.  Two  of  the 
subcontractor  subsystems  were  then  elaborated  into  a  task  architecture  and,  in  one  case,  populated 
with  detailed  design,  Ada  code,  and  a  generated  runtime  system  from  AADL  to  analyze  best  allo¬ 
cation  of  computer  resources  including  schedulability  and  actual  testing  of  code.  Finally,  the  de¬ 
tailed  subsystem  models  were  delivered  to  the  integrator  and  integrated  into  the  system  architec¬ 
ture  for  system-level  analysis,  which  revisits  all  previous  analyses  based  on  the  refined  models 
and  data  from  actual  code  replacing  the  initial  resource  estimates  such  as  execution  time. 

In  2010,  the  demonstration  was  expanded  by 

•  refining  the  physical  systems  with  mechanical  models  to  better  demonstrate  the  ability  to  per¬ 
form  virtual  integration  and  analysis  across  system  engineering  and  software  engineering 
boundaries  (also  demonstrated  in  a  paper  by  Bozzano  and  associates  [Bozzano  2010]).  In  par¬ 
ticular,  the  demonstration  showed  how  a  single  AADL  model  could  represent  the  interaction 
between  a  finite  element  model  of  the  aircraft  wing  structure  and  a  mechatronics  actuator 
model. 

•  including  a  focus  on  safety  and  reliability  requirements,  supporting  hazard,  fault  impact,  and 
reliability  analysis.  In  particular,  the  demonstration  illustrated  the  ability  to  perform  Func¬ 
tional  Flazard  Assessment  (FFIA),  Failure  Mode  and  Effect  Analysis  (FMEA),  and  Mean 
Time  To  Failure  (MTTF)  analysis  from  one  set  of  Error  Model  Annex  annotations  to  the 
AADL  model  of  the  aircraft,  with  a  focus  on  applying  these  analyses  to  the  embedded  flight 
guidance  system. 

•  applying  static  analysis  such  as  model  checking  to  validate  the  mode  logic  and  other  system 
and  software  behavior  specifications  early  and  throughout  the  development  life  cycle  with  the 
goal  of  demonstrating  end-to-end  system  validation  from  requirements  to  implementation.  In 
particular,  the  redundancy  logic  of  the  flight  guidance  system  was  validated  under  nominal 
conditions,  as  well  as  under  the  occurrence  of  several  types  of  failures. 

Architecture-centric  modeling  and  analysis  with  AADL  has  also  been  applied  to  a  reference  archi¬ 
tecture  for  autonomous  space  vehicles  [Feiler  2009c]  and  to  the  evaluation  of  potential  migration 
issues  for  an  avionics  system  moving  from  a  federated  architecture  to  an  IMA  architecture  [Feiler 
2004].  Other  examples  include  the  use  of  model-based  analysis  with  AADL  in  the  context  of  an 
architecture  evaluation  using  the  ATAM  and  the  development  and  piloting  of  a  virtual  upgrade 
validation  method  that  incorporates  techniques  to  address  the  root  cause  areas  identified  in  Sec¬ 
tion  2.4  [DeNiz  2012]. 
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3.3  Static  Analysis 


Static  analysis  is  any  technique  that,  prior  to  deployment  and  execution,  mathematically  proves 
system  properties  from  the  system  description.  As  mentioned  in  Section  3.1,  it’s  necessary  to  ap¬ 
ply  static  analysis  to  benefit  from  formalizing  requirements,  since  it  enables  verification  of  their 
completeness  and  consistency.  Furthermore,  unlike  other  validation  methods  such  as  testing  and 
simulation,  static  analysis  techniques  apply  throughout  the  development  cycle:  for  validation  of 
requirements,  verification  of  design  models  against  requirements,  verification  of  implementation 
against  the  design  model,  and  verification  of  code  against  common  problems  such  as  deadlocks, 
buffer  overflows,  and  so  on.  Static  analysis  leads  to  early  error  detection  and  improves  the  devel¬ 
opment  process  by  providing  an  end-to-end  validation  framework  [Miller  2010,  Bozzano  2010]. 

Many  system  properties  are  amenable  to  static  analysis,  such  as  performance,  resource  allocation, 
and  ability  to  diagnose.  Flowever,  for  safety-critical  systems,  the  most  prominent  form  of  static 
analysis  is  analysis  of  behavioral  properties  using  model  checking  and  abstract  interpretation. 
Static  analysis  can  establish  that  the  system  (or  its  model)  satisfies  its  functional  requirements, 
never  enters  its  unsafe  region,  never  produces  a  runtime  error,  and  never  deadlocks.  Case  studies 
show  that  model  checking  is  more  effective  than  testing  for  detecting  subtle  defects  in  the  design 
[Miller  2010]. 

Scalability  is  the  main  bottleneck  to  a  widespread  use  of  formal  methods-based  static  analysis  in 
development  of  safety-critical  systems.  This  is  not  surprising:  establishing  the  safety  of  a  system 
is  a  difficult  problem.  It  is  unlikely  that  static  analysis  techniques  will  ever  scale  frilly  to  complex 
systems  and  displace  other  validation  techniques  such  as  testing  and  simulation.  However,  experi¬ 
ence  shows  that  scalability  can  be  achieved  by  a  combination  of  abstraction  (i.e.,  extracting  finite 
models  of  the  system)  [Clarke  1994],  designing  for  verification  [Groundwater  1995,  Miller  2010], 
compositional  verification  [Clarke  1989,  Grumberg  1994],  and  by  developing  domain-specific 
analysis  tool  chains  [Groundwater  1995,  Miller  2010]. 

We  proceed  with  highlighting  some  of  the  state-of-the-art  approaches  to  static  analysis  of  discrete 
system  behavior  and  static  analysis  of  other  system  properties,  and  give  examples  of  end-to-end 
validation  systems  based  on  these  techniques. 

3.3.1  Static  Analysis  of  Discrete  System  Behavior 

We  divide  discrete  systems  into  two  types:  finite  state  (systems  with  finite  control  and  data  do¬ 
mains)  and  infinite  state  (systems  with  finite  control  but  infinite  data  domain).  Finite-state  sys¬ 
tems  can  be  expressed  as  models  (or  programs)  using  only  Boolean  and  enumerated  data  types. 
Infinite-state  systems  are  models  (or  programs)  that  require  unbounded  (or  continuous)  state  val¬ 
ues.  We  describe  static  analyses  techniques  for  both  types  of  systems  in  turn. 

Model-checking  is  the  most  prominent  static  analysis  technique  for  finite-state  systems  [Clarke 
1999].  A  model  checker  takes  as  an  input  a  finite  description  of  the  system  and  a  property  (or  a 
requirement)  formalized  in  temporal  logic  and  decides  whether  the  system  satisfies  the  property. 
If  the  system  does  not  satisfy  the  property,  the  model  checker  returns  a  counterexample — ^that  is, 
an  execution  that  violates  the  property.  The  ability  to  produce  counterexamples  makes  model 
checkers  effective  debugging  tools  for  both  the  system  and  the  formalization  of  the  requirement. 
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Figure  27:  Rockwell  Collins  Translation  Framework  for  Static  Analysis 

To  be  suitable  for  model  checking,  a  model  must  describe  the  behavior  of  a  system  using  only 
Boolean  and  enumerated  types.  Many  industrial  systems  have  components  that  satisfy  this  criteri¬ 
on  or  can  be  altered  to  satisfy  it  [Miller  2010].  In  practice,  the  model  is  often  generated  from  an 
architectural  or  other  description  of  the  system.  For  example,  Rockwell  Collins  and  the  University 
of  Minnesota  have  developed  a  translation  framework  to  automatically  construct  models  for  mod¬ 
el  checking  from  Simulink  and  Stateflow  modeling  languages  (see  Figure  27).  A  similar  ap¬ 
proach,  but  based  on  AADL  models,  is  taken  by  Noll  and  associates  in  the  COMPASS  project 
[Bozzano  2010,  COMPASS  2011]. 

Properties  must  be  formalized  in  temporal  logic. 

•  Linear  temporal  logic  (LTL)  is  used  to  specify  a  behavior  of  a  system’s  individual  execution. 
It  extends  propositional  logic  (that  can  express  a  single  state  of  a  system)  with  these  temporal 
modalities: 

Globally,  true  in  every  state  of  the  execution 
Future,  true  in  some  future  state  of  the  execution 
Until,  true  from  the  current  state  until  some  condition  (or  event)  is  seen 
For  example,  a  requirement  “every  request  is  eventually  acknowledged”  is  formalized  in 
LTL  as 

G  (req  implies  F  ack) . 

req  and  ack  are  Boolean  state  variables  that  indicate  the  generation  of  a  request  and  receipt 
of  an  acknowledgement,  respectively.  The  LTL  expression  states,  “it  is  always  the  case  that  a 
request  is  followed  by  an  acknowledgment  sometime  in  the  future.” 

•  Computation  tree  logic  (CTL)  is  used  to  specify  a  requirement  with  respect  to  all  system  be¬ 
haviors  at  once.  It  extends  the  temporal  modalities  of  LTL  with  universal  (A,  for  all 
paths)  and  existential  (E,  exists  a  path)  path  quantifiers.  For  example,  the  above 
requirement  “every  request  is  eventually  acknowledged”  is  formalized  in  CTL  as 

EG (req  implies  AF  ack). 


CMU/SEI-2012-SR-013  |  46 


The  CTL  expression  states  that  “there  exists  a  path  on  which  a  request  is  always  acknowl¬ 
edged.” 

It  is  often  inconvenient  to  formalize  the  requirements  directly  in  temporal  logic.  Several  alterna¬ 
tives  are  available.  Whenever  requirements  are  formalized  using  one  of  the  formal  frameworks 
described  in  Section  3.1.1,  they  can  be  converted  into  a  suitable  temporal  logic  automatically. 
Property  patterns  provide  templates  for  many  common  requirement  specifications  [Dwyer  1999]. 
Safety  properties  can  also  be  described  by  directly  annotating  the  model  with  assertions  or  by  em¬ 
bedding  monitors. 

A  variety  of  industry-grade  model  checkers  are  available  that  have  different  strengths  and  weak¬ 
nesses. 

•  NuSMV  [Cimatti  2002]  is  a  symbolic  model  checker*^  that  is  a  new  (Nu)  implementation  and 
extension  of  Symbolic  Model  Verification  (SMV),  the  first  model  checker  based  on  binary 
decision  diagrams  (BDDs).  It  was  developed  as  a  joint  project  between  the  Formal  Methods 
group  at  the  Istituto  Trentino  di  Culture  -  Istituto  per  la  Ricerca  Scientifica  e  Tecnologica 
(ITC-IRST),  the  Model  Checking  group  at  Carnegie  Mellon  University,  the  Mechanized  Rea¬ 
soning  Group  at  University  of  Genova,  and  the  Mechanized  Reasoning  Group  at  University  of 
Trento.  NuSMV  has  been  designed  to  be  an  open  architecture  for  model  checking  that  can  be 
reliably  used  for  the  verification  of  industrial  designs,  as  a  core  for  custom  verification  tools, 
and  as  a  testbed  for  formal  verification  techniques.  It  can  also  be  applied  to  other  research  ar¬ 
eas.  NuSMV  has  been  extensively  used  in  verification  of  safety-critical  systems.  It  is  the  main 
reasoning  engine  in  the  work  of  Miller  and  associates  and  the  COMPASS  project  [Miller 
2010,  Bozzano  2010,  COMPASS  2011].  It  has  been  successfully  used  to  verify  systems  with 
10^°°  states  [Miller  2010]. 

•  Simple  Promela  Interpreter  (SPIN)  is  an  explicit-state  model  checker  for  verification  of  dis¬ 
tributed  systems  [Holzmann  2003].  Its  development  began  at  Bell  Labs  in  1980  in  the  original 
UNIX  group  of  the  Computing  Sciences  Research  Center.  Spin  provides  a  very  rich  modeling 
language  called  Promela.  The  supported  features  of  the  language  include  dynamic  creation  of 
concurrent  processes  and  communication  via  synchronous  and  asynchronous  messages.  The 
SPIN  model  checker  has  been  used  extensively  in  verification  of  telecommunication  proto¬ 
cols. 

•  The  C  Bounded  Model  Checker  (CBMC)  is  aimed  at  static  analysis  of  embedded  software 
[Clarke  2004].  It  supports  models  described  in  ANSI  C  and  System  C  languages.  It  allows  for 
verification  for  buffer  overflows,  pointer  safety,  and  user-specified  assertions.  It  can  also  be 
used  to  check  ANSI  C  implementations  for  consistency  with  other  languages  such  as  Verilog. 
Unlike  NuSMV  and  SPIN,  the  CBMC  is  incomplete  since  it  examines  only  bounded  execu¬ 
tions  of  the  system.  However,  it  can  be  directly  applied  to  the  system’s  source  code. 

Static  analysis  of  discrete  infinite-state  systems  is  much  harder  than  analysis  of  finite-state  sys¬ 
tems.  It  is  well  known  from  the  works  of  Church,  Godel,  and  Turing  in  the  1930s  that  verification 
of  such  systems  is  undecidable.  Thus,  techniques  for  static  analysis  of  infinite  systems  are  inher¬ 
ently  incomplete.  That  is,  they  typically  detect  all  possible  errors  but  occasionally  either  indicate 


For  more  information,  go  to  http://nusmv.fbk.eu/NuSMV/index.html. 
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an  error  that  cannot  really  happen  (a  false  alarm)  or  never  terminate  (in  this  case,  the  tool  is  typi¬ 
cally  stopped  when  resources  are  exhausted  but  before  the  analysis  has  completed). 

ASTREE  is  a  static  program  analyzer  aimed  at  proving  the  absence  of  RunTime  Errors  (RTEs)  in 
C  programs  [Cousot  2005].  It  can  analyze  structured  C  programs  that  have  complex  memory  us¬ 
age  but  no  dynamic  allocation.  It  is  targeting  embedded  programs  in  earth  transportation,  nuclear 
energy,  medical  instrumentation,  aeronautic,  and  aerospace  applications.  It  has  been  used  to  com¬ 
pletely  and  automatically  prove  the  absence  of  any  RTE,  in  the  primary  flight  control  software  of 
the  Airbus  A340  fly-by-wire  system  and  to  analyze  the  electric  flight  control  codes  for  the  A3 80 
series. 
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Figure  28:  CounterExample-Guided  Abstraction  Refinement  Framework 


CounterExample-Guided  Abstraction  Refinement  (CEGAR)  is  a  technique  pioneered  at  Carnegie 
Mellon  University  that  extends  finite-state  model  checkers  to  analyze  infinite -state  systems 
[Clarke  2003].  The  technique  uses  an  automated  theorem  prover  (or  an  Satisfiability  Modulo 
Theory  (SMT)-solver)  to  automatically  extract  a  finite-state  abstraction  of  an  infinite-state  system 
(see  Figure  28).  The  behaviors  of  the  abstraction  are  a  superset  of  the  concrete  behaviors.  Thus,  if 
the  abstraction  is  shown  to  be  safe  by  a  model  checker,  the  concrete  system  is  safe  as  well.  If  the 
abstraction  is  unsafe,  the  counterexample  generated  by  the  model  checker  is  used  to  either  con¬ 
struct  an  unsafe  execution  of  the  concrete  system  or  to  automatically  refine  the  abstraction.  Many 
academic  tools  are  available  in  this  active  area  of  research. 


3.3.2  Static  Analysis  of  Other  System  Properties 

Static  analysis  is  not  limited  to  analyzing  discrete  system  behavior.  There  are  static  analysis  tech¬ 
niques  for  scheduling,  resource  allocation,  and  real-time,  probabilistic,  and  hybrid  control  proper¬ 
ties.  Here,  we  highlight  tools  for  verification  of  real-time  and  probabilistic  properties. 

In  traditional  model-checking,  temporal  properties  are  defined  with  respect  to  the  temporal  order 
of  events.  However,  the  actual  passage  of  time  between  events  is  ignored.  For  example,  it  is  pos¬ 
sible  to  check  whether  a  request  is  acknowledged  but  not  whether  a  request  is  acknowledged  with¬ 
in  a  given  time  bound,  say  10ms.  When  real  time  is  important,  the  system  must  be  modeled  with 
real-valued  clocks. 
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Static  analysis  of  such  real-time  systems  can  be  done  with  the  UPPAAL  tool  that  supports  model¬ 
ing  systems  as  a  nondeterministic  composition  of  timed  automata  [UPPAAL  2009].  It  can  then 
simulate  and  model-check  properties  of  such  systems.  It  uses  symbolic  techniques  to  reduce  the 
state  space  exploration  problem.  It  has  been  used  in  a  variety  of  industrial  case  studies,  including 
verification  of  an  automobile  gearbox  controller. 

For  some  properties,  it  is  important  to  consider  an  inherent  probabilistic  nature  of  real  systems. 

For  example,  it  might  be  necessary  to  know  the  expected  time  for  a  request  to  be  acknowledged  or 
the  expected  power  consumption.  These  questions  can  be  answered  by  a  probabilistic  model 
checker,  such  as  PRISM  [Kwiatkowska  2010]  or  the  Markov  Reward  Model  Checker  (MRMC) 
[Katoen  2009].  In  this  case,  the  system  is  modeled  as  a  Markov  model,  and  properties  are  ex¬ 
pressed  in  temporal  logic  extended  with  probabilistic  modalities. 

3.3.3  End-to-End  Validation 

Static  analysis  enables  a  complete  end-to-end  validation  framework.  Formalized  requirements  can 
be  validated  for  completeness  (all  system  behaviors  are  considered),  consistency  (no  requirement 
conflicts  with  another),  and  explored  for  possibility  (which  potential  behaviors  are  compatible 
with  requirements  and  which  are  not).  This  process  can  find  flaws  in  the  requirements  before  a 
detailed  design  is  constructed.  This,  in  turn,  avoids  the  costs  of  fixing  the  requirements  later  dur¬ 
ing  the  design  cycle  and  leads  to  faster  qualification.  Formalized  detailed  designs  can  be  validated 
against  the  requirements.  This  process  can  find  flaws  in  the  designs  (or,  potentially,  in  the  re¬ 
quirements)  before  implementation  is  completed.  This  avoids  the  cost  of  fixing  the  design  and 
requirements  during  the  implementation  phase  of  the  development  cycle.  Finally,  the  implementa¬ 
tion  can  be  validated  against  the  detailed  design.  This  facilitates  more  efficient  discovery  of  im¬ 
plementation  errors  than  testing  and  simulation  alone  [Miller  2010].  The  overall  framework  pro¬ 
motes  modularity,  which  is  paramount  for  scalability  of  static  analysis  techniques. 
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Figure  29:  The  Architecture  of  the  COMPASS  Tooiset 

The  COMPASS  project  funded  by  the  European  Space  Agency  (ESA)  is  a  good  example  of  a 

toolset  supporting  end-to-end  validation  through  static  analysis  (see  Figure  29)  [COMPASS 

2011].  The  framework  is  building  on  AADL  and  the  AADL  Error  Model  Annex.  It  provides  tool 

support  for  the  following  analyses: 

•  formalizing  requirements  as  specification  patterns  and  parameterized  templates.  This  facili¬ 
tates  capturing  safety,  correctness,  performance,  and  reliability  requirements. 

•  requirement  validation  through  the  Requirements  Analysis  Tool  (RAT)  [FBK  2009].  The  re¬ 
quirements  are  converted  automatically  to  a  suitable  temporal  logic.  The  included  analyses 
are  consistency  (checking  that  requirements  do  not  contradict  one  another)  and  assertion 
checking  (checking  whether  requirements  satisfy  a  given  assertion). 

•  functional  validation  through  model  checking  using  the  NuSMV2  tool  [Cimatti  2002].  This 
includes  checking  whether  the  design  satisfies  its  requirements  under  nominal  conditions  and 
checking  for  the  absence  of  deadlocks. 

•  safety  analysis  with  traditional  techniques  including  FTA  and  FMEA 

•  performance  evaluation  to  compute  system  performance  under  degraded  operations  based  on 
probabilistic  inference  using  Markov  Reward  Model  Checker  (MRMC) 
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•  Fault  Detection,  Isolation,  and  Recovery  (FDIR)  analyses  to  verify  that  the  system  can 
properly  detect,  isolate,  and  recover  from  a  given  fault. 

3.4  Confidence  in  Quaiification  Through  System  Assurance 

We  define  assurance  to  be  justified  confidence  that  a  system  will  function  as  intended  in  its  envi¬ 
ronment  of  use.  When  we  dissect  this  seemingly  simple  definition  into  its  component  parts,  we 
find  that  it  is  actually  quite  complex.  Confidence  is  justified  only  if  there  is  believable  evidence 
supporting  that  confidence.  A  system  functions  as  intended  only  if  it  does  what  its  ultimate  users 
intend  for  it  to  do  as  they  are  actually  using  it,  even  though  usage  patterns  will  differ  among  indi¬ 
vidual  users.  It  also  functions  as  intended  only  if  it  properly  mitigates  the  possible  causes  (inten¬ 
tional  or  unintentional)  of  critical  failures  to  minimize  their  impact.  Finally,  a  systems  environ¬ 
ment  of  use  includes  the  actual  environment  of  use,  not  just  the  intended  environment  of  use.  A 
shutdown  system  might  work  perfectly  at  sea  level  but  totally  fail  5,000  feet  under  the  surface 
where  it  is  actually  being  used. 

How  do  we  achieve  this  justified  confidence?  Historically  we  have  relied  on  the  development 
process  and  extensive  testing.  However,  a  recent  study  by  the  National  Research  Council  (NRC) 
titled  “Dependable  Systems:  Sufficient  Evidence?”  says  that  testing  and  good  development  pro¬ 
cesses,  while  indispensable,  are  not  by  themselves  enough  to  ensure  high  levels  of  dependability 
[Jackson  2007]: 

...  it  is  important  to  realize  that  testing  alone  is  rarely  sufficient  to  establish  high  levels  of 
dependability.  It  is  erroneous  to  believe  that  a  rigorous  development  process,  in  which  test¬ 
ing  and  code  review  are  the  only  verification  techniques  used,  justifies  claims  of  extraordi¬ 
narily  high  levels  of  dependability.  Some  certification  schemes,  for  example,  associate  high¬ 
er  safety  integrity  levels  with  more  burdensome  process  prescriptions  and  imply  that 
following  the  processes  recommended  for  the  highest  integrity  levels  will  ensure  that  the 
failure  rate  is  minuscule.  In  the  absence  of  a  carefully  constructed  dependability  case,  such 
confidence  is  misplaced. 

Such  a  dependability  case  augments  testing  when  testing  alone  is  infeasible  or  too  costly.  The 
case  establishes  a  relationship  between  the  tests  (and  other  evidence)  and  the  properties  claimed. 

What  the  NRC  report  calls  a  dependability  case,  the  community  at  large  is  calling  an  assurance 
case.  Under  either  name,  it  is  somewhat  similar  to  a  legal  case.  In  a  legal  case,  there  are  two  basic 
elements.  The  first  is  evidence,  such  as  witnesses,  fingerprints,  or  DNA.  The  second  is  an  argu¬ 
ment  given  by  the  attorneys  as  to  why  the  jury  should  believe  that  the  evidence  supports  (or  does 
not  support)  the  claim  that  the  defendant  is  guilty  (or  innocent).  A  jury  presented  with  only  an 
argument  that  the  defendant  is  guilty,  with  no  evidence  that  supported  that  argument,  would  cer¬ 
tainly  have  reasonable  doubts  about  the  defendant’s  guilt.  A  jury  presented  with  evidence  without 
an  argument  explaining  why  the  evidence  was  relevant  would  have  difficulty  deciding  how  the 
evidence  relates  to  the  defendant. 

The  goal-structured  assurance  case  is  similar.  Affirming  evidence  is  associated  with  a  property  of 
interest  (e.g.,  safety),  attesting  that  it  fulfills  its  claim.  For  instance,  test  results  might  be  collected 
into  a  report.  Without  an  argument  as  to  why  the  test  results  support  the  claim  of  safety,  an  inter¬ 
ested  party  could  have  difficulty  seeing  its  relevance  or  sufficiency.  With  only  a  detailed  argu¬ 
ment  that  depends  on  test  results  to  show  that  a  system  was  safe,  but  does  not  provide  those  re- 
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suits,  again  it  would  be  hard  to  establish  the  system’s  safety.  So  a  goal-structured  assurance  case 
as  shown  in  Figure  30  specifies  a  claim  (or  goal)  regarding  a  property  of  interest,  evidence  that 
supports  that  claim,  and  a  detailed  argument  explaining  how  the  evidence  supports  the  claim. 


Figure  30:  A  Goal-Structured  Assurance  Case 


The  goal-structured  assurance  case  [Kelly  2004]  has  been  used  extensively  outside  of  the  United 
States  for  a  number  of  years  to  assure  the  safety^'’  of  nuclear  reactors,  railroad  signaling  systems, 
avionics  systems,  and  so  forth.  Assurance  cases  are  now  being  developed  for  other  attributes  (e.g., 
security  of  a  software  supply  chain  [Ellison  2008])  and  other  activities  (e.g.,  acquisition 
[Blanchette  2009]). 


As  evidenced  by  the  NRC  report,  there  is  increasing  interest  in  assurance  cases  in  the  United 
States.  International  Standards  Organization  (ISO)  standard  (15026-2)  for  assurance  cases  is  now 
under  development.  The  U.S.  Food  and  Drug  Administration  (FDA)  recently  began  to  suggest 
their  inclusion  in  regulatory  submissions  [FDA  2010]. 

In  the  best  practice,  an  engineering  organization  will  develop  an  assurance  case  in  parallel  with 
the  system  it  assures.  The  case’s  structure  will  be  used  to  influence  assurance-centered  actions 
throughout  the  life  cycle.  The  co-development  of  the  assurance  case  has  several  important  ad¬ 
vantages: 

•  It  can  help  determine  which  claims  are  most  critical  and,  hence,  what  evidence  and  assurance- 
related  activities  are  most  needed  to  support  such  claims. 

•  It  can  help  guide  design  decisions  that  will  simplify  the  case  and,  thus,  make  it  easier  to  de¬ 
velop  a  convincing  argument  that  important  properties  have  been  met. 

•  It  serves  as  documentation  as  to  why  certain  design  decisions  have  been  made,  making  it  eas¬ 
ier  to  revisit  these  decisions  should  the  need  arise,  helping  to  communicate  engineering  exper¬ 
tise,  and  allowing  for  more  efficient  reuse  in  subsequent  systems. 

When  used  to  show  safety,  an  assurance  case  is  called  a  safety  case. 
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•  It  can  help  management  make  a  more  accurate  determination  of  whether  the  development  is 
on  track  to  produce  a  system  that  meets  its  requirements. 

Whether  co-developed  or  not,  the  resulting  product  is  useful  for  supporting  qualification  (and  re¬ 
qualification)  decisions,  managing  resources  and  activities  (by  showing  which  activities  have  the 
most  payoff  for  claims  of  particular  importance),  and  estimating  the  impact  of  design  and  re¬ 
quirements  changes  (by  showing  which  portions  of  the  case  may  be  affected). 

3.4.1  Requirements  and  Claims 

There  are  basically  two  approaches  for  structuring  an  assurance  case:  (1)  focus  on  identifying  re¬ 
quirements,  showing  that  they  are  satisfied  or  (2)  focus  on  hazards  to  fulfilling  those  require¬ 
ments,  showing  that  the  hazards  have  been  eliminated  or  adequately  mitigated.  The  approaches 
are  not  mutually  exclusive — to  show  that  a  requirement  is  met,  one  often  has  to  show  that  hazards 
defeating  the  requirement  have  been  eliminated  or  mitigated — but  each  has  a  different  flavor. 

Each  type  has  a  role  to  play  in  developing  an  assurance  case. 

Because  developers  are  used  to  stating  nonfunctional  requirements  (e.g.,  safety,  availability,  per¬ 
formance)  and  then  ensuring  that  they  are  satisfied,  top-level  claims  in  an  assurance  case  often 
have  a  requirements  flavor  (e.g.,  “X  is  safe,”  which  might  be  decomposed  into  subclaims  that  the 
“X  is  electrically  safe,”  “X  is  safe  to  operate,”  etc.).  Typically,  these  nonfunctional  requirements 
arise  from  an  understanding  of  hazards  that  need  to  be  addressed;  each  such  requirement,  if  satis¬ 
fied,  mitigates  one  or  more  hazards.  But  if  the  case  only  addresses  derived  requirements  whose 
satisfaction  implies  safety,  (e.g.,  “timeout  is  5  seconds”),  the  link  to  the  hazards  mitigated  by  the 
requirement  can  be  lost;  it  can  become  difficult  to  decide  if  the  requirement  is  adequate  to  address 
the  underlying  hazard(s). 

To  see  how  a  focus  on  requirements  can  obscure  underlying  hazards,  let’s  consider  an  example. 
Suppose  we  have  a  safety-critical  system  that  must  monitor  its  operational  capability  and  can  ei¬ 
ther  run  connected  to  an  electrical  outlet  or  using  a  battery.  An  obvious  hazard  is  loss  of  battery 
power;  one  might  therefore  state  a  safety  requirement  aimed  at  helping  to  ensure  that  the  system  is 
plugged  into  an  electrical  power  source  prior  to  battery  exhaustion.  Such  a  requirement  might  be 
worded  as  follows: 

When  operating  on  battery  power,  visual  and  auditory  alarms  are  launched  at  least  10 
minutes  prior  to  battery  exhaustion  but  no  more  than  15  minutes  prior. 

To  demonstrate  that  this  claim  holds  for  a  particular  system,  we  could  provide  test  results  showing 
that  warnings  are  raised  at  least  10  minutes  prior  to  battery  exhaustion  but  no  more  than  15 
minutes  prior.  In  addition,  we  could  present  arguments  showing  that  we  have  confidence  in  such 
test  results  because  the  structure  of  the  tests  has  taken  into  account  various  potential  causes  of 
misleading  results.  For  example,  since  the  battery  discharge  profile  changes  depending  on  the  age 
of  a  battery,  we  would  need  to  show  that  all  the  tests  were  run  with  a  mixture  of  new  and  well- 
used  batteries.  Similarly,  since  the  electrical  load  might  affect  the  time  to  battery  exhaustion,  we 
would  need  to  show  that  the  tests  were  run  with  different  electrical  loads  on  the  system. 

We  can  represent  such  a  safety  requirement  as  a  claim  in  an  assurance  case  together  with  the  evi¬ 
dence  and  other  arguments  needed  to  show  that  the  requirement  is  satisfied  (see  Figure  31).  One 
set  of  evidence  is  the  aforementioned  testing  results.  However,  we  can  increase  confidence  in  the 
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validity  of  the  claim  by  also  showing  that  pitfalls  to  valid  testing  have  been  adequately  mitigated. 
Going  a  step  further,  our  confidence  in  the  validity  of  the  claim  would  also  be  increased  by  an 
argument  asserting  the  accuracy  of  the  algorithm  used  to  estimate  remaining  battery  life.  The 
combination  of  testing  results  and  algorithm  analysis  makes  the  case  stronger  than  if  just  test  re¬ 
sults  alone  were  presented.  To  support  the  algorithm-accuracy  claim,  a  developer  might  reference 
design  studies  and  literature,  as  well  as  an  analysis  of  the  actual  design. 


Figure  31:  Confirming  That  a  Safety  Requirement  Has  Been  Satisfied 

Such  tests  and  analysis  are  fine  for  demonstrating  that  the  requirement  is  satisfied.  But  from  a 
safety  viewpoint,  we  have  little  documentation  about  what  hazard  the  requirement  is  mitigating. 

In  addition,  how  do  we  know  that  10  minutes  is  the  appropriate  warning  interval  for  every  setting? 
Is  10  minutes  enough  time  for  someone  to  respond  to  the  alarm?  Will  the  alarm  be  heard  in  every 
possible  setting?  How  accurate  does  the  measure  of  remaining  power  need  to  be  (e.g.,  is  it  unac¬ 
ceptable  if  the  alarms  are  launched  when  20  minutes  of  power  remains)?  How  does  this  require¬ 
ment  fit  with  other  safety  requirements?  In  short,  to  fully  understand  and  validate  the  requirement, 
we  need  to  establish  the  larger  context  within  which  the  requirement  exists. 
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Figure  32:  Context  for  Raising  an  Aiarm  About  Impending  Battery  Exhaustion 
Figure  32  presents 

•  claims  related  to  a  battery-exhaustion  warning  system  and 

•  the  context  for  such  claims 

Directly  below  the  first  statement  “The  system  is  safe”  we  have  eliminated,  for  simplicity’s  sake, 
reference  to  other  safety  hazards  requiring  mitigation  that  would  normally  appear  beneath  such  a 
claim.  We  also  state  that  any  hazard  of  system  shutdown  due  to  battery  exhaustion  has  been  miti¬ 
gated.  With  these  matters  settled,  we  proceed  to  the  timing  considerations  that  surround  raising  an 
alarm  that  warns  of  impending  battery  exhaustion. 

The  proposed  mitigation  for  battery  exhaustion  is  to  notify  a  system  maintainer  in  a  timely  man¬ 
ner  that  the  battery  is  about  to  become  exhausted.  This  is  shown  in  the  case  by  making  the  claim 
of  notifying  the  maintainers  “sufficiently  soon”  but  not  “too  soon.”  We  are  now  in  a  position  to 
state  the  safety  requirement  about  when  an  alarm  needs  to  be  raised.  In  addition,  we  can  now 
readily  deal  with  other  hazards  not  addressed  directly  by  the  safety  requirement;  namely,  we  can 
consider  whether  the  warning  time  is  sufficient  to  allow  human  response  and  also  whether  the 
alarm  is  sufficiently  noticeable  that  the  human  will  be  unlikely  to  ignore  it.^' 


The  case  supporting  the  "Alarm  noticeability”  claim  could  be  fairly  complex,  since  the  total  variety  of  alarms  and 
indicators  needs  to  be  considered,  as  well  as  the  fact  that  some  alarms  are  more  important  than  others.  One 
could  ask:  Is  the  system  safer  if  the  auditory  alarm  is  louder  or  more  urgent  when  the  remaining  battery  life  is 
less  than  five  minutes?  Less  than  three?  And  what  happens  when  there  are  competing  alarms?  Which  one  gets 
the  highest  alarm  signal?  Is  the  overall  alarming  strategy  for  the  system  consistent  with  user  interface  standards 
for  alarm  signaling?  Will  the  alarm  for  an  important  condition  be  loud  enough  to  be  heard  over  competing 
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Taken  altogether,  the  exhaustion  mitigation  and  notification  claims  establish  the  context  and  va¬ 
lidity  of  what  was  originally  an  isolated  safety  requirement.  Whether  ah  this  argumentation  is 
necessary  depends  on  the  importance  of  dealing  with  battery  exhaustion  and  the  extent  to  which 
there  is  a  standard  way  of  dealing  with  it.  Less  argument  (and  evidence)  is  needed  to  support  mit¬ 
igations  of  less  important  hazards  or  where  there  is  consensus  on  adequate  ways  of  addressing  a 
particular  hazard. 

A  benefit  of  focusing  on  safety  requirements  is  that  stating  the  safety  requirements  and  demon¬ 
strating  that  they  have  been  met  seems  straightforward  from  a  user  viewpoint.  But  a  safety  assur¬ 
ance  case  that  only  addresses  whether  safety  requirements  are  met  will  focus  primarily  on  what 
tests  and  test  conditions  or  other  analyses  are  considered  sufficient  to  show  the  requirements  are 
met.  The  case  is  likely  to  be  less  convincing  when  it  does  not  deal  explicitly  with  ah  relevant  haz¬ 
ards  (i.e.,  when  the  reasoning  leading  from  the  hazards  to  the  requirement  is  not  part  of  the  case). 

Another  problem  with  a  pure  requirements-based  approach  is  that  it  can  be  difficult  to  specify 
fault-tolerant  behavior.  For  example,  consider  a  high-level  requirement  such  as  “The  system  does 
X.”  Satisfying  this  requirement  would  certainly  seem  to  satisfy  a  higher  level  claim  that  the  sys¬ 
tem  is  safe.  But  the  requirement,  as  stated,  implies  that  the  system  always  does  X,  and  there  may 
be  factors  outside  the  system’s  control  that  can  prevent  this  from  happening.  From  a  safety  view¬ 
point,  we  want  to  ensure  the  system  minimizes  the  chances  of  becoming  unsafe.  Stating  a  claim 
that  is  unachievable  in  the  real  world  doesn’t  allow  the  case  to  adequately  address  safety  hazards 
and  their  mitigations. 

From  a  safety  argument  perspective,  instead  of  focusing  on  safety  requirements,  per  se,  it  is  more 
convincing  to  state  (and  satisfy)  hazard  mitigation  claims.  For  example,  a  claim  such  as  “The  pos¬ 
sibility  of  not  doing  X  has  been  mitigated”  allows  the  assurance  case  to  discuss  the  possible  haz¬ 
ards  to  doing  X  and  then  to  explain  the  mitigation  approaches,  which  can  include  raising  alarms  to 
cause  a  human  intervention. 

3.4.2  Assurance  Over  the  Life  Cycle 

Evidence-based  arguments  start  with  a  single  claim  and  then  build  an  argument  out  of  subclaims 
at  multiple  levels.  Eventually  the  lowest  level  claims  are  supported  by  evidence,  and  the  end  result 
is  that  the  high-level  claim  is  substantiated.  The  nature  of  the  argument  and  the  nature  of  the  evi¬ 
dence  will  necessarily  change  as  development  moves  through  the  different  stages  of  the  life  cycle. 
At  early  stages,  an  argument  consisting  of  broad  strokes  supported  by  informal  “hand  waving” 
evidence  will  allow  design  decisions  to  be  made  and  development  to  continue.  As  the  develop¬ 
ment  continues,  the  arguments  needed  to  allow  continued  development  become  significantly  more 
detailed,  and  the  supporting  evidence  becomes  much  more  precise. 

As  an  example  of  the  above,  consider  a  system  that  has  a  requirement  to  restart  within  one  minute 
of  a  system  failure.  Early  in  the  life  cycle,  designers  are  faced  with  making  decisions  on  how  to 


alarms  or  the  sounds  of  other  equipment?  All  these  issues  can  be  raised  and  dealt  with  in  the  expansion  of  the 
"Alarm  noticeability”  claim. 
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best  meet  this  requirement.  Obvious  choices  include  hot  standby,  warm  restart,  and  cold  restart. 
Each  has  its  costs  and  benefits,  and  tradeoffs  must  be  made. 


Figure  33:  An  Assurance  Case  Early  in  Design 

Figure  33  shows  a  partial  assurance  case  for  such  a  design.  Only  the  case  for  cold  restart  has  been 
expanded  because  that  alternative  has  proven,  at  this  early  stage,  to  likely  be  able  to  accomplish 
the  goal  of  restarting  within  one  minute.  If  there  were  any  question  about  this  goal  being  met,  the 
other  alternatives  would  have  also  been  expanded  to  enable  a  more  informed  decision.  Figure  34 
shows  an  assurance  case  for  the  same  system  later  in  the  life  cycle.  It  is  both  simpler  (the  rejected 
alternatives  have  been  removed  from  the  case)  and  more  complex  (the  analysis  of  the  cold  restart 
alternative  is  supported  by  additional  evidence)  than  the  case  in  Figure  33.  As  the  system  is  de¬ 
veloped  further,  the  case  is  augmented  with  actual  test  results  as  evidence,  as  well  as  more  de¬ 
tails — simulations,  AADL  models,  and  so  on. 
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We  expect  to  see  much  more  detailed  evidence  as  the  development  of  the  system  continues  along 
the  V  that  we  discussed  in  Section  1.1.  Thus  it  is  important  to  develop  and  maintain  the  assurance 
case  in  parallel  with  the  system  being  assured.  This  has  a  multitude  of  benefits,  including 

•  An  assurance  case  fully  documents  the  system  being  developed,  leading  to  more  confidence 
in  the  quality  of  the  system  and  making  it  more  likely  that  the  design  will  be  understood  as 
new  personnel  are  brought  onboard. 

•  An  assurance  case  developed  in  parallel  with  development  of  the  system  can  lead  to  more 
insight  into  system  quality  earlier  in  the  development  cycle  and  can  take  less  expensive  cor¬ 
rective  action  if  problems  begin  to  surface. 

•  The  opportunities  for  reuse  of  a  design  documented  with  an  assurance  case  are  significantly 
greater  than  for  one  developed  without  it.  This  is  especially  true  if  assurance  case  patterns  are 
used.  An  assurance  case  pattern  is  a  template  that  captures  acceptable  ways  of  structuring  ge¬ 
neric  arguments  [Kelly  1998]. 
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4  A  Metric  Framework  for  Cost-Effective  Reiiabiiity 
Vaiidation  and  improvement 


The  objective  of  this  metric  framework  is  to  drive  the  cost-effectiveness  of  a  validation  and  relia¬ 
bility  improvement  strategy  for  system  qualification.  We  address  this  objective  from  two  perspec¬ 
tives: 

1 .  by  focusing  on  high-risk  areas  that  introduce  faults  into  a  system,  we  can  see  those  areas  in 
the  system  that  require  more  attention  to  reduce  the  introduction  of  faults 

2.  by  accounting  for  the  effectiveness  of  validation  methods  at  different  times  in  the  life  cycle, 
we  can  understand  the  effectiveness  of  methods  to  discover  and  remove  faults  throughout  the 
development  life  cycle. 

This  allows  us  to  reduce  development  and  qualification  cost  by  avoiding  rework  and  retest 
through  early  discovery  of  faults,  ft  also  allows  us  to  focus  the  validation  resources  on  those  parts 
of  the  system  and  validation  methods  that  most  cost-effectively  minimize  residual  system  faults 
with  acceptable  risk.  We  proceed  by  considering 

•  architectural  metrics  that  reflect  the  potential  of  system-level  faults 

•  qualification-evidence  metrics  based  on  assurance  cases  that  reflect  sufficient  evidence  and 
acceptable  risk  for  the  absence  of  faults  in  the  qualification  of  a  system 

•  metrics  that  reflect  cost  and  cost  savings  in  system  development  and  validation. 

4.1  Architecture-Centric  Coverage  Metrics 

Traditional  reliability  engineering  has  focused  on  fault  density  and  reliability  growth  as  key  met¬ 
rics.  These  are  statistical  process  metrics  that  reflect  the  presence  of  faults  by  tracking  discovery 
of  faults  during  review  and  testing,  as  well  as  failures  during  operation.  This  has  worked  well  for 
slowly  evolving  physical  systems  where  the  emphasis  is  on  failure  of  physical  components  rather 
than  fault  in  the  design.  Such  metrics  have  had  limited  success  with  software  systems  because 
software  faulfs  are  design  faults.  Furthermore,  as  software  is  frequently  changed  and  evolved, 
there  is  additional  design  fault  potential. 

Review  and  testing  has  been  the  primary  approach  for  addressing  faults.  Sequential  code  faults  are 
activated  by  certain  execution  paths  operating  on  given  data,  which  can  be  addressed  by  test  cov¬ 
erage.  For  systems  with  concurrent  processing  and  sharing  of  resources,  the  combinations  of  pos¬ 
sible  interactions  grow  exponentially,  and  faults  such  as  race  conditions  are  difficult  to  test  for. 
Especially  for  embedded  software  systems,  the  operational  environment  affects  the  behavior  of 
the  system.  A  change  in  operational  context  may  cause  the  system  to  behave  in  a  way  that  acti¬ 
vates  a  previously  latent  fault. 

We  propose  three  metrics  that  aid  in  addressing  faults  introduced  during  the  system  design  and  do 
so  early  in  the  life  cycle:  (1)  one  focusing  on  requirements  coverage,  (2)  one  focusing  on  safety 
hazard  coverage,  and  (3)  one  focusing  on  architecture-level  system  interaction  coverage. 
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4.1.1 


A  Requirements  Coverage  Metric 


Requirement  specifications  are  a  key  artifact  in  the  development  process,  since  systems  are  vali¬ 
dated  against  their  requirements.  As  we  have  seen  in  Sections  2.2  and  2.3,  requirement  errors  are 
major  contributors  to  system-level  problems  that  are  currently  discovered  late  in  the  development 
process.  Requirement  specifications  are  often  incomplete/ambiguous  and  incorrect/inconsistent. 
High-level  requirements,  which  may  be  difficult  to  validate  in  themselves,  are  refined  into  con¬ 
crete  requirements  for  which  qualification  evidence  can  be  provided  in  form  of  analysis,  simula¬ 
tion,  or  testing.  Requirements  for  a  system  must  be  specified  with  respect  to  its  external  interac¬ 
tions.  Furthermore,  requirements  at  the  system  level  must  be  decomposed  into  requirements 
placed  on  system  components.  This  decomposition  must  be  done  across  the  system  architecture  to 
include  requirement  specifications  beyond  functional  requirements  on  the  embedded  software 
sysfem. 

We  define  a  requirements  coverage  metric  with  several  contributors.  We  base  this  definition  on 
the  assumption  that  requirements  are  associated  with  elements  of  a  system  architecture — in  other 
words,  that  requirements  can  be  traced  to  specific  system  components. 

The  first  contributor  reflects  the  coverage  of  all  interaction  points  of  a  system  or  system  compo¬ 
nent  with  its  context.  We  can  measure  coverage  of  the  requirement  specification  with  respect  to 
its  input,  output,  resource  demands,  and  control,  as  well  as  its  state  and  behavior  (see  Figure  12  in 
Section  3.1.2).  For  input/output  interactions,  the  requirement  specification  must  address  domain 
data  type,  expected  value  range,  measurement  units,  data  stream  characteristics  (such  as  rate,  la¬ 
tency,  ordering,  and  value  changes),  and  implied  resource  demand.  For  control  interactions  the 
requirement  specification  must  address  action  request  and  responses  (including  expected  ordering 
of  actions).  The  system  (component)  behavior  must  be  characterized  in  terms  of  discrete  states 
(set  of  behavioral  states  and  transitions  between  states,  as  well  as  continuous  value  state  spaces — 
often  expressed  as  equations).  A  requirement  specification  must  address  resource  demands  in 
terms  of  types  of  resources;  demand  amount,  such  as  worst-case  execution  time;  rate  of  demand, 
such  as  the  period  of  a  task;  and  time  frame  in  which  the  resource  must  be  available,  such  as  dead¬ 
line.  The  requirement  specification  must  not  only  address  nominal  mission  operation,  but  also 
safety-criticality  aspects,  such  as  the  ability  to  address  safety  hazards. 

A  second  contributor  to  the  requirements  coverage  metric  is  the  degree  of  decomposition  of  re¬ 
quirements  into  requirements  on  system  components.  This  contributor  (1)  tracks  the  degree  to 
which  requirements  at  one  level  of  the  system  hierarchy  are  addressed  by  requirements  of  its 
components  and  (2)  reflects  whether  satisfaction  of  component  requirements  is  a  necessary  or 
sufficient  condition  to  satisfy  the  system  requirement. 

A  third  contributor  is  the  consistency  of  the  requirement  specification.  This  includes  consistency 
and  correctness  between  elements  of  a  requirement  specification  of  a  system  or  component,  as 
well  as  between  the  requirement  specification  of  system  components  and  that  of  the  containing 
system.  In  other  words,  this  contributor  reflects  the  set  of  constraints  that  requirement  specifica¬ 
tions  satisfy.  Examples  of  such  constraints  are  (1)  the  reachability  of  states  in  a  behavioral  state 
description,  (2)  the  consistency  between  input/output  requirement  specification  of  the  system  and 
those  of  its  components,  (3)  the  relationships  among  the  processing  rate  of  a  task,  the  intended 
sampling  rate  of  input,  and  the  arrival  rate  of  data  streams,  (4)  the  relationship  between  resource 
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budgets  of  components  and  those  of  the  system,  and  (5)  the  traceability  of  hazards  in  a  FHA  to 
failure  modes  in  an  FMEA. 

4.1.2  A  Safety  Hazard  Coverage  Metric 

The  part  of  the  system  addressing  safety  and  reliability  makes  up  a  considerable  portion  of  the 
system  functionality,  and  its  robustness  to  hazards  is  critical  to  system  operation.  Assuring  that 
safety  hazards  are  being  addressed  is  critical;  in  particular  we  need  a  better  understanding  of  the 
hazard  contributions  by  the  embedded  software  systems.  We  define  a  safety  hazard  coverage 
measure  with  several  contributors. 

The  first  contributor  reflects  how  complete  the  hazard  specifications  are,  that  is,  how  well  the 
specification  covers  a  known  set  of  hazards.  The  hazard  coverage  count  tracks  for  the  system  and 
its  components  how  many  hazard  specifications  are  documented  for  each  of  the  interaction  points 
(input/output,  control,  resource  usage)  of  the  requirement  specification  and  compares  it  to  the 
known  hazard  count.  A  hazard  specification  indicates  whether  an  error  is  being  propagated  out,  is 
expected  as  an  incoming  hazard,  or  is  expected  to  be  contained  (masked)  by  the  originator  (com¬ 
pleteness  of  specification). 

For  embedded  software  hazards  we  leverage  fault  containment  mechanisms,  such  as  the  use  of 
dedicated  processors,  runtime-enforced  protected  address  spaces,  and  virtual  machines/partitions. 
Flazard  and  error  propagation  between  software  units  in  different  fault  containment  units  can  be 
limited  to  their  interactions  in  terms  of  input/output,  control,  and  shared  resource  usage.  Faults, 
whether  they  are  design  or  coding  errors  inherent  in  the  software  component  or  the  result  of  error 
propagation  from  other  software  or  from  hardware,  manifest  themselves  as  interaction  errors.  As 
discussed  in  Section  3.1.4,  we  have  a  known  set  of  potential  hazards  for  data  streams,  control  in¬ 
teraction,  and  resource  usage. 

The  second  contributor  reflects  consistency  between  hazard  specifications  of  interacting  compo¬ 
nents.  The  hazard  consistency  count  tracks  how  many  interactions  in  the  form  of  connections  and 
deployment  bindings  to  platform  resources  have  an  inconsistent  set  of  hazard  specifications  for 
their  endpoints.  The  hazard  specifications  of  two  interaction  endpoints  are  inconsistent  if  the  re¬ 
cipient  expects  the  potential  hazards  to  be  masked  while  the  originator  intends  to  propagate  them. 

The  third  contributor  reflects  the  impact  potential  on  high-criticality  components.  This  high- 
criticality  impact  count  tracks  the  number  of  error  propagation  channels  (interaction  points  and 
deployment  mappings)  from  lower  criticality  components  to  higher  criticality  components,  as  well 
as  the  number  of  intended  and  unintended  error  propagations  on  each  channel.  A  higher  count 
represents  a  higher  safety  risk. 

The  fourth  contributor  reflects  the  robustness  of  the  system  and  its  components  to  unintended 
hazard  propagation,  that  is,  propagation  of  errors  due  to  fault  activation  within  a  component  that 
were  not  intended  to  be  propagated  according  to  the  hazard  specification.  For  this  count,  focus  is 
on  the  ability  of  error  propagation  recipients  to  tolerate  propagations  that  they  expected  to  be 
masked.  The  count  tracks  the  number  of  outgoing  masked  hazard  specifications  for  which  the  re¬ 
cipient  specification  indicates  expected  masking,  rather  than  expected  incoming  error  propaga¬ 
tion. 
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4.1.3  A  System  Interaction  Coverage  Metric 


The  quality  of  a  system  implementation  is  strongly  affected  by  architectural  decisions.  In  Section 
2.4,  we  identified  root  cause  areas  for  introducing  system-level  faults  due  to  interactions  among 
embedded  software  system  components,  their  interactions  with  the  physical  system  and  environ¬ 
ment,  and  their  interactions  with  the  underlying  computer  platform.  Therefore,  we  complement 
coverage  metrics  for  software  code,  such  as  Decision  Coverage  (DC)  and  Modified  Condi¬ 
tion/Decision  Coverage  (MC/DC)  found  in  DO-178B  [RTCA  1992],  with  coverage  metrics  that 
focus  on  architecture-level  system  interactions. 

System  interactions  can  introduce  architectural  design  complexity.  The  objective  of  the  system 
interaction  coverage  metric  is  to  capture  several  contributors  to  this  complexity. 

The  first  contributor  tracks  different  peer-to-peer  architecture  interaction  patterns,  such  as  a  pipe¬ 
line,  service  request/response,  or  a  control  feedback  loop  (see  Section  3.2.3),  in  a  system.  Many 
such  interactions  are  between  subsystems  that  maintain  state.  Each  participant  in  the  pattern  has 
its  own  state  machine  with  transitions  and  makes  assumptions  about  the  state  of  the  other  partici¬ 
pants.  Work  by  Miller  has  shown  that  specification  of  interactions  between  two  small  state  ma¬ 
chines  with  transitions  is  prone  to  error  [Miller  2005a].  Therefore,  the  size  of  the  state  machines, 
transition  coverage,  and  reachability  of  states  within  each  subsystem,  as  well  as  their  interactions, 
provide  a  good  measure  of  the  system’s  interaction  complexity. 

The  second  contributor  focuses  on  the  operation  modes  of  the  system  and  its  subsystems.  Opera¬ 
tional  modes  represent  operational  states  in  which  a  system  or  system  component  exhibits  a  par¬ 
ticular  behavior.  Larger  embedded  systems  have  operational  modes  at  the  system  level  to  reflect 
mission  operation.  These  are  supported  by  operational  modes  of  various  subsystems,  which  them¬ 
selves  may  make  use  of  subsystems  with  operational  modes.  This  support  requires  a  coordination 
of  operational  mode  states  across  the  architectural  hierarchy.  We  have  a  measure  of  mode  state 
and  transition  coverage  and  consistency  between  the  system’s  and  subsystems’  operational 
modes. 

The  third  contributor  focuses  on  the  fault  management  portion  of  the  system  architecture,  that  is, 
on  redundancy  patterns  and  the  recovery  of  faults,  as  well  as  the  traceability  between  the  identi¬ 
fied  hazards  and  their  expected  mitigation  and  the  fault  management  mechanisms  in  the  actual 
system.  We  have  measures  of  the 

•  complexity  of  the  redundancy  management  logic  for  each  encountered  redundancy  pattern, 
which  may  operate  in  a  distributed  setting 

•  complexity  in  coordinating  fault  recovery  across  different  subsystems,  which  is  similar  to  the 
coordination  of  operational  modes 

•  traceability  coverage  between  safety  and  reliability  hazards  and  their  management  in  the  safe¬ 
ty-critical  portion  of  the  system 

4.2  Qualification-Evidence  Metrics 

Standards  for  safety-critical  systems  such  as  DO-178B  focus  on  specifying  qualification  criteria. 
Criticality  levels  are  identified  for  different  system  components,  and  qualification  criteria  are  as¬ 
signed  to  each — a  larger  set  and  more  stringent  criteria  for  higher  criticality  levels.  The  criticality 
level  is  determined  by  (1)  the  cause  of  a  software  component’s  anomalous  behavior  or  (2)  that 
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behavior’s  contribution  to  the  failure  of  a  system  function  that  results  in  system  hazards  and  fail¬ 
ure  conditions  of  varying  severity.  Five  severity  levels  combined  with  likelihood  of  occurrence 
(expressed  qualitatively  as  likeliness  of  occurrence  or  quantitatively  as  probability  of  occurrence) 
act  as  a  system  safety  management  decision  matrix  [Boydston  2009]. 

The  underlying  assumption  is  that,  by  satisfying  these  criteria,  software  will  have  reached  a  level 
of  quality  sufficient  to  be  an  acceptable  risk.  These  qualification  criteria  are  a  combination  of  de¬ 
sign-  and  implementation-related  criteria,  as  well  as  development-  and  qualification-process- 
related  criteria.  Examples  of  system-  and  implementation-related  criteria  are  requirement  specifi¬ 
cation  and  design  documentation  guidance,  and  coverage  ranging  from  dead  code  to  Modified 
Condition/Decision  Coverage  (MC/DC).  Examples  of  qualification-process-related  criteria  are 
requirements  traceability,  and  correct  implementation  and  application  of  test  suites. 

These  criteria  can  be  mapped  into  an  assurance  case  framework,  and  we  will  use  such  a  frame¬ 
work  to  drive  a  qualification-evidence  metric.  This  is  conceptually  illustrated  in  Figure  34.  Claims 
represent  qualification  criteria  on  the  system  and  its  subsystems,  that  is,  requirements  that  must  be 
satisfied  by  the  system  design  and  implementation  as  shown  by  the  evidence.  The  operational 
context  and  the  assumptions  for  each  claim  are  documented.  The  evidence  takes  the  form  of  a 
V&V  activity,  ranging  from  review  and  testing  to  formal  analysis.  The  process  of  producing  the 
evidence  has  its  own  set  of  claims  and  evidence,  such  as  validity  of  the  model  or  test  case  imple¬ 
mentation  and  correct  application  of  the  evidence-generating  method.  The  risk-level  annotations 
reflect  the  criticality  levels  of  different  subsystems  and  different  requirements  on  those  subsys¬ 
tems.  Whether  the  evidence  for  meeting  the  qualification  criteria  is  sufficient  is  reflected  in  the 
argument  and  represents  a  risk  assessment  by  the  qualification  authority. 


Figure  34:  Qualification  Evidence  Through  Assurance  Cases 

We  have  several  contributors  to  the  qualification-evidence  metric.  The  first  contributor  focuses  on 
the  claim  hierarchy.  We  identify  the  significance  of  each  subclaim’s  contribution  to  a  claim  in 
order  to  reflect  the  potential  impact  of  an  unsubstantiated  subclaim,  and  we  weight  it  with  the  crit¬ 
icality  of  the  subsystem  for  which  the  claim  is  made.  We  determine  the  degree  of  claim  coverage 
by  subclaims,  using  a  technique  similar  to  the  one  used  for  requirements  decomposition  coverage. 
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We  take  into  account  the  context  in  which  the  evidence  was  produced  (e.g.,  the  assumptions  about 
the  operational  environment  made  during  the  tests)  when  assessing  the  risk  of  deploying  the  sys¬ 
tem  in  various  deployment  scenarios. 

The  second  contributor  identifies  the  degree  to  which  specific  evidence  contributes  to  the  substan¬ 
tiation  of  a  claim  to  reflect  the  impact  that  the  lack  of  particular  evidence  has  on  the  confidence  in 
the  claim.  We  take  into  account  the  assumptions  made  in  the  evidence-producing  process  (i.e.,  the 
fidelity  of  the  model  abstraction  with  respect  to  the  actual  system,  the  consistency  of  the  model 
with  respect  to  the  actual  system  and  other  models,  and  the  proper  execution  of  the  evidence- 
producing  process).  In  this  context  we  may  also  take  into  account  work  on  the  use  of  a  strategy- 
based  risk  model  to  assess  the  impact  of  different  steps  in  the  validation  process  of  Research  and 
Development  satellites  in  terms  of  expected  risk  [Langenbmnner  2010]. 

The  third  contributor  reflects  the  effectiveness  of  the  method  used  to  produce  the  evidence.  A  de¬ 
fect  removal  efficiency  metric  is  intended  to  reflect  the  effectiveness  of  specific  validation  meth¬ 
ods  in  discovering  errors  (i.e.,  to  achieve  fault  prevention).  Boehm,  Miller,  and  Jones  provide 
source  examples  for  this  metric  [Madachy  2010,  Miller  2005b,  Jones  2010].  Rushby  discusses  an 
approach  to  probabilistically  assess  the  imperfection  of  software  in  the  context  of  software  verifi¬ 
cation  as  part  of  system  assurance  [Rushby  2009].  We  can  consider  incorporating  this  idea  of 
probability  of  imperfection  or  uncertainty  to  claims  and  evidence. 

4.3  A  Cost-Effectiveness  Metric  for  Reiiabiiity  improvement 

The  objective  of  this  metric  is  to  determine  cost  savings  from  the  application  of  methods  for  early 
error  detection.  Such  methods  result  in  avoidance  of  certain  rework  and  reduction  of  retest  cycles, 
thus  reducing  error  leakage  and  increasing  our  confidence  in  the  qualification  evidence.  For  this 
metric,  we  can  draw  on  two  pieces  of  work:  the  AVSI  SAVI  retum-on-investment  (ROI)  study 
[Ward  2011]  and  the  Constructive  QUALity  MOdel  (COQUALMO)  work  [Madachy  2010].  Both 
these  efforts  draw  on  a  taxonomy  and  on  results  of  a  NASA  study  by  Hayes  [Hayes  2003]. 

The  SAVI  ROI  study  approaches  the  problem  of  estimating  cost  savings  due  to  rework  and  retest 
avoidance  by  comparing  error  introduction  and  removal  percentages  in  the  current  development 
practice  (shown  in  the  aggregate  on  page  Figure  3  on  page  7))  against  the  architecture-centric 
model-based  virtual  integration  approach  of  SAVI.  When  we  take  the  normalized  rework  and  re¬ 
test  cost  factors  shown  in  Table  1  on  page  8,  and  apply  them  to  the  error  percentages  introduced  in 
early  phases  (requirements  and  design)  and  detected  in  late  phases  (integration,  system  and  ac¬ 
ceptance  testing),  we  see  that  rework/retest  cost  due  to  requirement  and  design  errors  dominates 
the  total  rework/retest  cost. 

Table  2  shows  the  cost  to  remove  a  defect  of  a  given  type  relative  to  the  total  cost  of  defect  re¬ 
moval.  For  example,  requirements  defects  account  for  79%  of  rework  cost  and  62%  of  rework 
cost  occur  during  integration. 
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Table  2:  Relative  Defect  Removal  Cost  (as  Percent) 


Phase  in  Which  Defect  is  Removed 


Defect  Type 

Requirements 

Design 

Code 

Test 

Integration 

Sum 

Requirements 

0.03% 

0.21% 

1.87% 

28.11% 

48.73% 

78.97% 

Design 

0.04% 

0.37% 

5.79% 

9.75% 

15.96% 

Code 

0.19% 

1 .28% 

3.10% 

4.57% 

Test 

0.17% 

0.26% 

0.43% 

Integration 

0.09% 

0.09% 

Sum 

0.03% 

0.26% 

2.44% 

35.36% 

61.92% 

100.00% 

By  estimating  the  ability  of  a  SAVI  approach  to  discover  errors  in  early  phases,  possibly  in-phase 
with  the  introduction,  we  can  determine  the  percentage  of  rework/retest  cost  that  can  be  avoided. 
Representatives  from  eight  AVSI  SAVI  member  companies  provided  their  estimates  of  possible 
in-phase  detection  for  different  categories  of  requirements  errors  based  on  their  in-house  pilot 
experiences  with  SAVI  technologies.  We  used  the  resulting  figure,  66%,  and  as  a  conservative 
alternate,  33%,  in  the  ROI  calculation.  We  then  calculated  the  rework/retest  avoidance  cost  sav¬ 
ings  according  to  the  following  formula: 

Cost  avoidance  =  Estimated  total  development  cost  *  %  Rework  cost  *  %  Requirements  error 
rework  cost  *  %  Requirements  error  removal  efficiency 

We  calculated  the  estimated  total  development  cost  with  the  Constructive  COst  MOdel 
(COCOMO)  II  [Boehm  2000]  using  software  SLOC  size  estimates  from  commercial  aircraft 
companies,  assuming  a  distributed  integrator/supplier  development  environment  with  commonly 
encountered  code  reuse  percentage.  The  system  development  cost  was  extrapolated  from  the 
software  cost  based  on  industry  figures  that  software  in  aircraft  systems  makes  up  two-thirds  of 
the  total  system  cost.  Again  based  on  industry  experience,  a  rework/retest  cost  percentage  (%  Re¬ 
work  cost)  of  30%  and  50%  were  chosen.  The  %  Requirements  error  rework  cost  figure  was  taken 
from  Table  2,  and  for  the  %  Requirements  error  removal  efficiency  rate  the  above  mentioned  es¬ 
timates  were  used.  The  cost  savings  shown,  even  for  the  most  conservative  estimate,  savings  con¬ 
siderably  greater  than  the  investment  for  a  single  aircraft  development.  We  compared  and  con¬ 
firmed  these  estimates  with  member  company  internal  estimates. 

In  addition  to  the  cost  savings,  the  ROI  study  calculated  the  arithmetic  and  logarithmic  (ROI)  as 
well  as  the  Net  Present  Value  (NPV)  based  on  an  investment  in  the  SAVI  technology  infrastruc¬ 
ture  of  $80  million. 

COQUALMO  was  developed  as  an  extension  to  COCOMO  for  predicting  the  number  of  residual 
defects  in  a  system  and  the  impact  of  a  schedule  change  to  the  cost  and  quality  of  the  system.  It  is 
being  further  extended  to  help  identify  strategies  for  reducing  defects  by  quantifying  the  impact  of 
different  processes  and  technologies  [Madachy  2010].  The  COQUALMO  extension  adds  defect 
introduction  rates  for  requirements,  design,  and  code  defects,  and  defect  removal  rates  for  differ¬ 
ent  removal  methods  (see  Figure  35).  The  defect  categories  for  requirements  are  correctness, 
completeness,  consistency,  and  ambiguity/testability.  For  design/code,  the  categories  are  inter¬ 
face,  timing,  class/object/function,  method/logic/algorithm,  data  values/initialization,  and  check¬ 
ing.  The  removal  methods  (peer  review,  automated  analysis,  and  execution  testing  and  tools)  are 
categorized  from  very  low  to  extra  high  (e.g.,  under  automated  analysis  from  simple  compiler 
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syntax  checking  to  formalized  specification  and  verification  through  model  checking  and  symbol¬ 
ic  execution).  The  model  has  been  calibrated  with  industry  data  and  applied  in  various  settings.  A 
continuous  simulation  model  version  has  been  used  to  evaluate  the  effectiveness  of  different  re¬ 
moval  methods  at  different  times  in  the  life  cycle. 


Figure  35:  COQUALMO  Extension  to  COCOMO 

We  propose  to  investigate  the  adaptation  and  possible  integration  of  the  two  models  above  to  the 
reliability  validation  and  improvement  framework.  In  this  context  we  intend  to  elaborate  the  cate¬ 
gories  of  defect  removal  capability  to  cover  the  full  range  of  evidence-producing  methods  in  an 
assurance-based  qualification  process.  In  particular,  we  can  consider  the  effectiveness  of  inde¬ 
pendent  system  integration  labs  and  virtual  system  integration  labs  in  reducing  defects.  A  virtual 
system  integration  lab  uses  a  SAVI-like  architecture-centric  virtual  integration  approach  to  evalu¬ 
ate  and  validate  system-level  requirements  as  early  as  possible  in  the  development  life  cycle.  We 
also  intend  to  evaluate  the  defect  categories  to  see  how  well  they  capture  the  challenges  discussed 
in  this  report  (e.g.,  the  issue  of  multiple  tmths  through  model  inconsistency). 

Boehm  discusses  the  use  of  different  risk  minimization  techniques  with  the  COQUALMO,  using 
defect  removal  as  the  primary  risk  reduction  measure  [Madachy  2010].  Our  proposed  supporting 
metrics  refine  such  risk  minimization  by  targeting  specific  system  areas,  taking  advantage  of  ar¬ 
chitectural  knowledge.  Furthermore,  we  take  into  account  both  defect  removal  and  management 
of  faults  and  hazards,  an  important  element  of  safety-critical  systems.  We  can  consider  the  possi¬ 
bility  of  extending  the  COQUALMO  to  take  these  extensions  into  account. 
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5  Roadmap  Forward 


The  aircraft  and  space  industry  in  the  U.S.  and  Europe  has  recognized  the  shortcomings  of  “build 
then  test”  for  their  safety-critical,  software-reliant  systems  and  has  pursued  a  SAVI  solution  using 
AADL  [SAE  2004-2012]  in  an  international  initiative  (AVSI).  The  objective  of  this  initiative  is  to 
mature  and  put  into  place  a  practice  infrastructure  to  support  SAVI  in  terms  of  standards-based 
methods  and  tools  as  discussed  in  Section  3.2.4.  This  maturation  is  accomplished  through  multi¬ 
ple  phases,  each  increasing  the  technical  readiness  level  (TRL)  as  illustrated  in  Figure  36.  The 
first  phase,  shown  as  the  proof-of-concept  loop,  was  completed  in  2009.  It  included  a  proof-of- 
concept  demonstration  (see  Section  3.2.4)  and  an  ROI  study  (see  Section  4.3)  [Ward  2011]. 


Figure  36:  Multi-Phase  SA  VI  Maturation  Through  TRLs 

Phase  2  includes  an  extended  POC  demonstration  emphasizing  the 

•  integration  between  system  engineering  and  software  engineering 

•  definition  of  model  repository  and  model  bus  requirements 

•  identification  of  technology  gaps  in  the  SAVI  framework,  and 

•  engagement  of  commercial  tool  vendors. 

Additional  phases  extend  into  2016  (shown  in  Figure  37)  [Redman  2010]. 
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Figure  37:  Multi-Year,  Multi-Phase  SAVI  Project  Plans 

The  objectives  of  the  SAVI  initiative  coincide  with  the  objectives  of  the  reliability  validation  and 
improvement  framework,  with  the  latter  placing  more  focus  on  the  qualification  of  software- 
reliant  systems.  The  Army,  already  an  active  member  of  the  SAVI  initiative,  is  positioned  to  take 
a  leadership  role  in  familiarizing  program  offices  with  this  technology  and  in  extending  and 
adapting  it  to  the  reliability  improvement  and  qualification  focus  of  the  DoD.  We  recommend  a 
set  of  actions  that  will  advance  the  practice  of  engineering  safety-critical  software-reliant  sys¬ 
tems — ^rotorcraft,  aircraft,  missile  systems,  automotive  systems,  autonomous  systems — ^through  a 
reliability  improvement  program  focused  on  embedded  software.  We  propose  actions  that  will 
integrate  and  mature  the  four  technology  areas  discussed  in  Section  3  and  facilitate  the  adoption 
of  this  reliability  improvement  and  qualification  practice. 

5.1  Integration  and  Maturation  of  Reliability  Validation  and  Improvement  Technologies 

In  its  Phase  2,  the  SAVI  initiative  has  placed  a  focus  on  reliability  and  safety.  There  is  an  oppor¬ 
tunity  to  pursue  the 

•  integration  of  formalized  requirements  (e.g.,  as  documented  in  the  Requirements  Engineering 
Management  Handbook  [FAA  2009a])  and  best  model-based  safety  analysis  practices  (e.g., 
the  systems  approach  to  safety  by  Leveson  [Leveson  2009]),  as  an  extension  of  the  SAVI 
framework  from  two  perspectives:  (I)  mapping  the  requirements  and  practices  into  an  archi¬ 
tectural  model  and  (2)  developing  assurance  case  patterns  around  safety  analysis  with  an  in¬ 
creased  focus  on  embedded-software-related  hazards.  This  integration  requires 

refining  the  Error  Annex  of  the  SAE  AADL  standard  to  better  accommodate  the  seman¬ 
tics  of  software-specific  faults  and  safety-criticality  hazards  and  their  mitigation,  such  as 
those  found  in  FHA,  FMEA,  and  STPA 

developing  a  safety-criticality  requirement  engineering  methodology  driven  by  safety 
analysis  and  focused  on  the  system  and  embedded  software  architecture  as  discussed  in 
Section  3.1 

developing  a  library  of  assurance  case  patterns  reflecting  the  necessary  evidence  to  build 
a  strong  assurance  case.  This  methodology  will  be  piloted  in  the  context  of  the  SAVI  ini- 
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tiative  to  expose  it  to  the  aerospace  industry  and  in  an  Army  program  to  gain  experience 
within  the  Army. 

•  expansion  of  system  integration  labs  (SILs)  into  virtual  system  integration  labs  (VSILs)  and 
determining  their  value-added  over  current  testing  practices  in  an  assurance-based  qualifica¬ 
tion  practice.  In  a  VSIL,  the  system  architecture,  its  component  design,  and  their  implementa¬ 
tion  may  be  represented  by  models  that  are  statically  analyzed,  simulated,  or  executed  on  a 
simulated  platform.  This  change  allows  Army  labs  to  establish  an  architecture-centric  model- 
based  independent  validation  and  qualification  practice,  throughout  the  development  life  cy¬ 
cle,  separate  from  DoD  contractors’  adoption  of  an  architecture-centric,  model-based  devel¬ 
opment  practice.  This  expansion  requires 

evaluating  existing  SILs  (as  well  as  the  proposed  VSILs)  by  identifying  problem  catego¬ 
ries  they  can  discover  early  now  and  problem  categories  that  currently  leak  to  later  phas¬ 
es  but  could  be  addressed  earlier  by  improved  capability  of  a  SIL  or  VSIL 
piloting  the  concept  of  a  VSIL  in  an  Army  lab  on  an  actual  system  to  validate  the  ability 
of  VSILs  to  discover  certain  problem  categories 

developing  an  error-leakage-reduction-driven  ROI  model  that  predicts  cost  savings 
achieved  in  qualification  through  rework  and  retest  cost  avoidance  using  VSILs  and  val¬ 
ue-added  SILs.  As  outlined  in  Section  4.3,  this  can  be  achieved  by  adapting  the  ROI 
model  developed  under  SAVI  Phase  1  [Ward  2011]  and  incorporating  aspects  of  the 
COQUALMO  [Madachy  2010],  as  well  as  calibrating  it  with  Army-specific  data. 

•  establishment  of  a  cost-effectiveness  metric  for  reliability  validation  and  improvement  as  out¬ 
lined  in  Section  4.  This  involves 

studying  the  cost  and  effectiveness  of  different  development  and  validation  methods  in 
the  development  and  qualification  life  cycle 
calibrating  the  model  with  Army-specific  data 

applying  a  resource  allocation  strategy  that  maximizes  reduction  of  error  leakage  and 
minimizes  risk,  by  focusing  on  high-risk  areas  and  taking  into  account  the  criticality  of 
the  system  components 

•  development  of  an  end-to-end,  assurance-based  qualification  methodology  and  its  piloting  in 
an  Army  program.  This  involves 

expanding  assurance  case  patterns  to  cover  the  full  development  life  cycle 
reflecting  in  the  argumentation  aspect  of  assurance  cases,  the  risk  of  not  providing  suffi¬ 
cient  or  sufficiently  qualified  evidence.  Such  an  assurance  case  framework  can  then  be¬ 
come  the  basis  for  a  metric  of  sufficient  evidence  that  allows  qualifiers  to  quantitatively 
assess  the  residual  risk  in  qualifying  software-reliant  systems  as  outlined  in  Section  4.2. 
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•  proactive  initiation  by  the  Army,  in  coordination  with  the  other  services,  of  investigations  into 

the  impact  of  new  technologies  and  paradigm  shifts  (such  as  the  use  of  multi-core  tech¬ 
nology  and  the  migration  to  mixed-criticality  systems)  on  the  safety  criticality  of  sys¬ 
tems  and  existing  analysis  methods 

reducing  the  potential  negative  reliability  impact  by  putting  a  new  analysis  framework 
into  place 

The  SEI  continues  its  work  value-driven  incremental  development  to  provide  the  architectural  and 
assurance  foundations  necessary  to  make  incremental  development  viable  in  the  DoD.  The  SEI 
also  continues  be  involved  in  the  SAVI  initiative  to  advance  an  architecture-centric  virtual  inte¬ 
gration  practice.  The  Army  has  an  opportunity  to  leverage  both  the  SAVI  initiative  and  the  SETs 
investment  in  these  technology  maturation  activities. 

5.2  Adoption  of  Reliability  Improvement  and  Qualification  Practice 

Adoption  of  this  architecture-centric,  model-based  reliability  validation  and  improvement  practice 
is  a  process  that  can  benefit  from  several  activities  driven  by  the  Army: 

•  The  Army  can  make  changes  to  its  acquisition  language.  The  SEI  has  experience  in  cooperat¬ 
ing  with  the  DoD  in  revising  such  language  to  encourage  the  use  of  these  technologies  (and 
requiring  them  as  appropriate)  while  they  are  maturing. 

•  The  Army  can  benefit  from  gaining  experience  with  the  use  of  the  technologies  to  better  un¬ 
derstand  their  benefits  and  limitations.  This  can  be  achieved  through  a  series  of  well- 
coordinated  pilot  projects  that  will  introduce  the  technologies  incrementally  in  existing  sys¬ 
tems  and  their  upgrades,  as  well  as  new  system  developments.  Pilots  allow  the  Army  to  pre¬ 
sent  a  strong  argument  to  contractors  who  might  resist  the  adoption  of  new  technologies.  Ex¬ 
amples  of  incrementally  introduced  technologies  include  (1)  a  model-based  variant  of  an 
architecture  evaluation  using  the  ATAM  and  (2)  assurance  cases  on  high-risk  aspects  of  a  sys¬ 
tem  or  system  upgrade.  Such  pilots  can  be  patterned  after  the 

Common  Avionics  Architecture  System  (CAAS)  evaluation  [Feiler  2004] 
case  study  of  NASA  Jet  Propulsion  Laboratory’s  Mission  Data  System  reference  archi¬ 
tecture  [Feiler  2009c] 

comparative  modeling  case  study  of  six  CAAS-based  helicopter  systems 

an  AADL  model  supported  architecture  evaluation  of  the  Apache  Block  3  upgrade  using 

the  ATAM 

the  application  of  the  Virtual  Upgrade  Validation  (VUV)  method  [DeNiz  2012]  to  evalu¬ 
ate  the  migration  of  Apache  to  the  Block  3  platform  and  a  ARINC653-compliant  plat¬ 
form 

Completion  of  the  pilots  could  make  way  for  development  of  a  handbook  for  safety  engineer¬ 
ing  for  software-reliant  systems,  patterned  after  the  Requirements  Engineering  Handbook 
[FAA  2009a].  Similarly,  experience  with  reliability  metrics  for  the  software-reliant  aspect  of 
systems  would  be  useful  toward  developing  an  addendum  to  the  Software-in-Systems  Relia¬ 
bility  Handbook  [DoD  2010].  Finally,  a  handbook  on  assurance-based  qualification  should  be 
developed. 
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•  The  Army  can  ensure  that  its  concurrence  with  the  results  of  SAVI  activities  is  recognized. 
The  SAVI  initiative  will  promote  model  representation  and  model  interchange  standards  in 
support  of  the  SAVI  engineering  framework,  and  can  extend  these  efforts  to  include  assur¬ 
ance-related  technology  standards.  In  addition,  practice  and  process  standards  for  safety- 
critical  systems  are  currently  under  revision.  For  example,  DO-178C  is  in  the  process  of  being 
finalized  and  includes  sections  on  object  technology,  model-based  engineering,  and  use  of 
formal  methods.  The  SAE  S8  committee  released  a  revision  of  SAE  ARP  4754  [SAE  2010] 
and  is  currently  revising  SAE  ARP  4761  [SAE  1996]. 
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6  Conclusion 


Rotorcraft  and  other  military  and  commercial  aircraft  frequently  undergo  migration  from  federat¬ 
ed  systems  to  IMA  architectures  and  experience  exponential  growth  in  on-board  software  size  and 
complexity.  This  is  due  to  increasing  reliance  on  complex  and  highly  integrated  hardware  and 
software  systems  for  safe  and  successful  mission  operation.  Current  industrial  practice  of  “build 
then  test”  has  resulted  in  increasing  error  leakage  to  system  integration  test  and  later  phases — 
rapidly  increasing  cost  and  reducing  confidence  in  purely  test-based  qualification. 

Reliability  engineering  has  its  roots  in  hardware  reliability  assessment  that  uses  historical  data  of 
slowly  evolving  system  designs.  Hardware  reliability  is  a  function  of  time,  accounting  for  the 
wear  of  mechanical  parts.  In  contrast,  software  reliability  is  primarily  driven  by  design  defects — 
resulting  in  a  failure  distribution  curve  that  differs  from  the  bathtub  curve  common  for  physical 
systems.  Furthermore,  when  assessing  the  reliability  of  a  system,  engineers  often  assume  software 
to  be  perfect  and  to  behave  deterministically — that  is,  to  produce  the  same  result  given  the  same 
input  or  to  predict  fault  occurrence  based  on  the  size  and  branching  complexity  of  source  code. 
Therefore,  the  focus  in  software  development  has  been  on  testing  to  discover  and  remove  bugs 
using  various  test  coverage  metrics  to  determine  test  sufficiency.  However,  embedded  software  is 
time  sensitive  and  implemented  as  a  concurrent  set  of  tasks,  leading  to  nondeterministically  oc¬ 
curring  race  conditions,  unexpected  latency  jitter,  and  unanticipated  resource  contention.  The 
source-code-based  reliability  growth  metrics  are  not  a  good  measure  of  such  system-level  interac¬ 
tion  complexity. 

The  high  cost  of  recertifying  software-reliant  systems,  required  for  acceptance  of  software  chang¬ 
es,  has  resulted  in  use  of  operational  work-arounds  rather  than  software  fixes  to  address  software- 
related  problems,  due  to  a  less  stringent  approval  process  for  these  work-arounds.  As  a  result,  op¬ 
erators  on  some  systems  spend  up  to  75%  of  their  time  performing  work-around  activities.  In  oth¬ 
er  words,  there  is  a  clear  need  for  improvements  in  recertification. 

There  is  also  a  need  for  a  reliability  improvement  program  of  these  software-reliant  systems;  it 
must  aim  to  overcome  the  limitations  of  current  reliability  engineering  approaches,  by  integrating 
best  emerging  technologies  that  have  shown  promise  in  industrial  application.  Several  studies  in 
the  U.S.  and  Europe  have  identified  four  key  technologies  in  addressing  these  challenges: 

1 .  specification  of  system  and  software  requirements  in  terms  of  both  a  mission-critical  system 
perspective  (function,  behavior,  performance)  and  safety-critical  system  perspective  (relia¬ 
bility,  safety,  security)  in  the  context  of  a  system  architecture  to  allow  for  completeness  and 
consistency  checking 

2.  architecture-centric,  model-based  engineering,  using  an  architecture  model  representation 
with  well-defined  semantics,  to  characterize  the  system  and  software  architectures  in  terms 
of  interactions  between  the  physical  system,  the  computer  system,  and  the  embedded  soft¬ 
ware  system.  When  annotated  with  analysis-specific  information,  the  model  becomes  the 
primary  source  for  incremental  validation  with  consistency  along  multiple  analysis  dimen¬ 
sions  through  virtual  integration. 
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3.  use  of  static  analysis  in  the  form  of  formal  methods  to  complement  testing  and  simulation  as 
evidence  of  meeting  mission  requirements  and  safety-criticality  requirements.  Analysis  re¬ 
sults  can  validate  the  completeness  and  consistency  of  system  requirements,  architectural  de¬ 
signs,  detailed  designs,  and  implementation,  and  ensure  that  requirements  and  design  con¬ 
straints  are  met  early  and  throughout  the  life  cycle. 

4.  use  of  assurance  cases  throughout  the  development  life  cycle  of  the  system  and  software  to 
provide  justified  confidence  in  claims  supported  by  evidence  that  mission  and  safety- 
criticality  requirements  have  been  met  by  the  system  design  and  implementation.  Assurance 
cases  systematically  manage  such  evidence  (e.g.,  reviews,  static  analysis,  and  testing)  and 
take  into  consideration  context  and  assumptions. 

A  number  of  initiatives  in  the  U.S.,  Europe,  and  Japan  are  integrating  and  maturing  these  technol¬ 
ogies  into  an  improved  safety-critical  software-reliant  system  engineering  practice.  In  particular, 
the  SAVI  initiative,  an  international  Aerospace  industry  effort,  offers  an  opportunity  of  leveraged 
cooperation  as  outlined  in  Section  5,  “Roadmap  Forward.” 

Applied  throughout  the  life  cycle,  reliability  validation  and  improvement  lead  to  an  end-to-end 
V&V  approach.  This  builds  the  argument  and  evidence  for  sufficient  confidence  in  our  system 
throughout  the  life  cycle,  concurrent  with  the  development.  The  framework  keeps  engineering 
efforts  focused  on  high-risk  areas  of  the  system  architecture  and  does  so  in  a  cost-saving  manner 
through  early  discovery  of  system-level  problems  and  resulting  rework  avoidance  [Feiler  2010]. 
From  a  qualification  perspective,  the  assurance  evidence  is  collected  throughout  the  development 
life  cycle  in  the  form  of  formal  analysis  of  the  architecture  and  design,  combined  with  testing  the 
implementation. 

The  architecture-centric  framework  provides  a  basis  for  a  reliability  validation  and  improvement 
program  of  software-reliant  systems  [Goodenough  2010].  Building  software-reliant  systems 
through  an  architecture-centric,  model-based  analysis  of  requirements  and  designs  allows  for  the 
discovery  of  system-level  errors  early  in  the  life  cycle. 

The  framework  also  provides  the  basis  for  a  set  of  metrics  that  can  drive  cost-effective  reliability 
validation  and  improvement.  These  metrics  address  shortcomings  in  statistical  fault  density  and 
reliability  growth  metrics  when  applied  to  software.  They  are  architecture-centric  metrics  that 
focus  on  a  major  source  of  system-level  faults:  namely,  requirements,  system  hazards,  and  archi¬ 
tectural  system  interactions.  They  are  complemented  by  a  qualification-evidence  metric  that  is 
based  on  assurance  case  structures,  leverages  the  DO-178B  model  of  qualification  criteria  of  dif¬ 
ferent  stringency  for  different  criticality  levels,  and  takes  into  account  the  effectiveness  of  differ¬ 
ent  evidence-producing  validation  methods. 

The  effects  of  acting  on  this  early  discovery  are  reduced  error  leakage  rates  to  later  development 
phases  (e.g.,  residual  defect  prediction  through  the  COQUALMO  [Madachy  2010])  and  major 
system  cost  savings  through  rework  and  retest  avoidance  (e.g.,  as  demonstrated  by  the  SAVI  ROI 
study  [Ward  2011]).  We  can  leverage  these  cost  models  to  guide  the  cost-effective  application  of 
appropriate  validation  methods. 
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Appendix  Selected  Readings 


For  additional  reading  on  the  topics  presented  in  this  report,  see  the  publications  below. 

•  [Leveson  2004b]  The  Role  of  Software  in  Spacecraft  Accidents.  This  paper  discusses  a  num¬ 
ber  of  software-related  factors  that  have  contributed  to  spacecraft  accidents. 

•  [Dvorak  2009]  NASA  Study  on  Flight  Software  Complexity.  This  paper  reports  the  results  of 
a  study  of  issues  related  to  the  increasing  complexity  in  on-board  software. 

•  [Boehm  2006]  Some  Future  Trends  and  Implications  for  System  and  Software  Engineering 
Processes.  This  paper  discusses  several  trends  for  improving  the  engineering  of  software¬ 
intensive  systems. 

•  [Feiler  2009b]  Challenges  in  Validating  Safety-Critical  Embedded  Systems.  This  paper  out¬ 
lines  system-level  problem  areas  in  safety-critical  embedded  software  systems  and  identifies 
four  root  cause  areas  that  can  be  addressed  through  architectural  analysis. 

•  [Goodenough  2010]  Evaluating  Software’s  Impact  on  System  and  System  of  Systems  Relia¬ 
bility,  SEI  March  2010.  ^  paper  summarizing  state  of  reliability  engineering  for  software- 
reliant  systems  and  the  need  for  software-specific  reliability  improvement  programs. 

•  [Jackson  2007]  Software  for  Dependable  Systems:  Sufficient  Evidence?  This  National  Re¬ 
search  Council  study  identifies  assurance  through  evidence  in  the  form  of  formal  analysis  of 
system  architectures  as  key  to  improving  embedded  software  in  dependable  systems. 

•  [Feiler  2009c]  Model-Based  Software  Quality  Assurance  with  the  Architecture  Analysis  & 
Design  Language.  A  case  study  on  use  of  AADL  to  analyze  a  multi-layered  reference  archi¬ 
tecture,  including  a  planning  and  plan  execution  component  and  its  instantiation  for  a  specif¬ 
ic  system  in  the  autonomous  space  vehicle  domain. 

•  [Feiler  2010]  System  Architecture  Virtual  Integration:  A  Case  Study,  A  summary  of  an  Aero¬ 
space  industry  (A  VSI)  case  study  involving  multi-tier  modeling  of  an  aircraft  and  multi¬ 
dimensional  analysis  at  different  levels  of fidelity  in  the  context  of  a  development  process  that 
involves  a  system  integrator  and  multiple  suppliers. 

•  [Feiler  2012]  Model-Based  Engineering  with  AADL:  An  Introduction  to  the  SAL  Architec¬ 
ture  Analysis  &  Design  Language.  This  book  provides  an  introduction  to  the  use  of  AADL  in 
architecture-centric  model-based  engineering. 

•  [Leveson  2009]  Engineering  a  Safer  World,  System  Safety  for  the  2C*  Century,  This  book 
reflects  Leveson ’s  latest  insights  on  safety  engineering. 

•  [Goodenough  2009]  Evaluating  Hazard  Mitigations  with  Dependability  Cases.  This  paper 
demonstrates  the  use  of  assurance  cases  to  validate  safety  hazard  mitigation. 

•  [Miller  2010]  Software  Model  Checking  Takes  Off.  This  paper  summarizes  the  state  of  model 
checking  in  industrial  applications. 

•  [Bozzano  2010]  Formal  Verification  &  Validation  of  AADL  Models.  This  work  illustrates  the 
use  of  the  Error  Annex  extension  to  AADL  in  modeling  and  validating  safety-critical  systems 
from  both  a  system  and  software  perspective. 
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Architectural  Analysis  and  Description  Language 
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AED 
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AF 

Air  Force 

AFB 

Air  Force  Base 

AFIS 

Association  Franpaise  d’lngenierie  Systeme 

AHS 

American  Helicopter  Society 

AIAA 

American  Institute  of  Aeronautics  and  Astronautics 

AMRDEC 

Aviation  and  Missile  Research,  Development  and  Engineering  Center 

AMSAA 

US  Army  Material  Systems  Analysis  Activity 

ANSI 

American  National  Standards  Institute 

AR 

Army  Regulation 

ARP 

Aeronautical  Recommended  Practice 

AS 

Aeronautical  Standard 

ASIIST 

Application  Specific  I/O  Integration  Support  Tool  for  Real-Time  Bus  Archi 
tecture  Designs 

ASN 

Abstract  Syntax  Notation 

ASSERT 

Automated  proof-based  System  and  Software  Engineering  for  Real-Time 
applications 

ASTREE 

Analyseur  statique  de  logiciels  temps-reel  embarques  (real-  time  embedded 
software  static  analyzer). 

ATAM 

Architecture  Tradeoff  Analysis  Method®, 

AVM 

Adaptive  Vehicle  Make 

AVSI 

Aerospace  Vehicle  Systems  Institute 

BAE 

British  Aerospace  &  Engineering 

CA 

California 
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Common  Avionics  Architecture  System 

CAST 

Commercial  Aviation  Safety  Team 

CAV 

Computer  Aided  Verification 

CBMC 

C  Bounded  Model  Checker 

CCA 

Common  Cause  Analysis 

CCM 

CORBA  Component  Model 

CD 

Compact  Disc 

CEGAR 

Counter  Example-Guided  Abstraction  Refinement 

CERT 

CERT"  and  "CERT  Coordination  Center"  are  registered  service  marks  of 
Carnegie  Mellon  University.  CERT  is  not  an  acronym. 

CMMI 

Capability  Maturity  Model  Integration 

CMU 

Carnegie  Mellon  University 

CO 

Colorado 

COCOMO 

constructive  COst  MOdel 

COMPASS 

Correctness,  Modeling  and  Performance  of  Aerospace  Systems 

COQUALMO 

constructive  QUALity  MOdel 

CORBA 

Common  Object  Request  Broker  Architecture 

COTS 

Commercial  Off  the  Shelf 

CTL 

Computation  Tree  Logic 

DACS 

Data  and  Analysis  Center  for  Software 

DARPA 

Defense  Advanced  Research  Projects  Agency 

DC 

Decision  Coverage 

DMA 

Direct  Memory  Access 

DNA 

Deoxyribonucleic  acid 

DO 

Document 

DOORS 

Dynamic  Object  Oriented  Requirements  System 

DOT 

Department  of  Transportation 

DRE 

Distributed  Real-time  Embedded 
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DSP 

DTIC 

EECS 

ERTS 

ESA 

ESOP 

FAA 

FAUST 

FDA 

FDIR 

FL 

FMCAD 

FMEA 

FPGA 

FTA 

GAO 

GE 

GORE 

GPP 

ICFEM 

ICSE 

lEC 

IEEE 

IFIP 

IMA 

INCOSE 

IRST 


Digital  Signal  Processing 

Defense  Technology  Information  Center 

Electrical  Engineering  Computer  Sciences 

Embedded  Real  Time  Systems 

European  Space  Agency 

European  Symposium  on  Programming 

Federal  Aviation  Administration 

Formal  Analysis  of  Goal-Oriented  Requirements  Using  Specification  Tools 

Food  and  Drug  Administration 

Fault  Detection,  Isolation  and  Recovery 

Florida 

Formal  Methods  in  Computer-Aided  Design  International  Conference 

Failure  Modes  Effects  Analysis 

Field  Programmable  Gate  Array 

Fault  Tree  Analysis 

Government  Accounting  Office 

General  Electric 

Goal-oriented  Requirements  Engineering 
General  Purpose  Processor 

International  Conference  on  Formal  Engineering  Methods 

International  Conference  on  Software 

International  Engineering  Consortium 

Institute  of  Electrical  and  Electronics  Engineers 

International  Federation  of  Information  Processing 

Integrated  Modular  Avionics 

International  Council  on  Systems  Engineering 

Istituto  per  la  Ricerca  Scientifica  e  Tecnologica 
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Acronym 

Definition 

ISBN 

International  Standard  Book  Number 

ISO 

International  Standards  Organization 

ISSRE 

International  Symposium  on  Software  Reliability  Engineering 

ITC 

International  Test  Gonference 

LTL 

Linear  T emporal  Logic 

MA 

Massachusetts 

MARTE 

Modeling  and  Analysis  of  Real-Time  and  Embedded  Systems 

MBSE 

Model  Based  Systems  Engineering 

MC 

Modified  Gondition 

MIL 

Military 

MILS 

Multiple  Independent  Levels  of  Security 

MISRA 

Motor  Industry  Software  Reliability  Association 

MIT 

Massachusetts  Institute  of  Technology 

MN 

Minnesota 

MPEG 

Models  and  Processes  for  the  Evaluation  of  COTS  Components 

MRMC 

Markov  Reward  Model  Ghecker 

MTBF 

Mean  Time  Between  Failure 

MTTF 

Mean  Time  To  Fix 

NASA 

National  Aeronautics  and  Space  Administration 

NCSC 

National  Gomputer  Security  Genter 

NDIA 

National  Defense  Industrial  Association 

NG 

Newsgroup 

NIST 

National  Institute  of  Science  and  Technology 

NPV 

Net  Present  Value 

NRG 

National  Research  Gouncil 

NSN 

National  Stock  Number 

NTIS 

National  Technical  Information  Service 
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Acronym 

Definition 

NUREG 

US  Nuclear  Regulatory  Commission 

0MB 

Office  of  Management  &  Budget 

OMG 

Object  Management  Group 

OSATE 

Open  Source  AADL  Tool  Environment 

PA 

Pennsylvania 

POC 

Proof  of  Concept 

POSIX 

Portable  Operating  System  Interface 

PRISM 

probabilistic  model  checker 

PSSA 

Preliminary  System  Safety  Assessment 

QEST 

Quantitative  Evaluation  of  Systems 

RAT 

Requirements  Analysis  Tool 

REP 

Request  for  Proposal 

RMA 

Rate  Monotonic  Analysis 

ROI 

Return  on  Investment 

ROM 

Read  Only  Memory 

RSML 

Requirements  State  Machine  Language 

RSP 

Rapid  System  Prototyping 

RTCA 

Radio  Technical  Commission  for  Aeronautics 

RTE 

Run  Time  Errors 

RTS 

Reliable  Software  Technologies 

RTSS 

Real-time  System  Symposium 

SA 

System  Assurance  or  Situational  Awareness 

SAE 

Society  of  Automotive  Engineers 

SAIC 

Science  Applications  International  Corporation 

SAVI 

Systems  Architecture  Virtual  Integration 

SC 

South  Carolina 

SCADE 

Safety-Critical  Application  Development  Environment 
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Acronym 

Definition 

SCR 

Software  Cost  Reduction 

SEI 

Software  Engineering  Institute 

SIGSOFT 

Special  Interest  Group  on  Software  Engineering 

SIL 

System  Integration  Lab 

SLOC 

Software  Lines  of  Code 

SMT 

Satisfiability  Modulo  Theory 

SMV 

Symbolic  Model  Verification 

SPICES 

Support  for  Predictable  Integration  of  Mission-Critical  Embedded  Systems 

SPIN 

Simple  Promela  Interpreter 

SSA 

System  Safety  Assessment 

STAMP 

Systems  Theory  Accident  Model  and  Processes 

STD 

Standard 

STPA 

STAMP  to  Prevent  Accidents 

SVM 

System  Verification  Manager 

SW 

Software 

TACAS 

Tools  and  Algorithms  for  the  Construction  and  Analysis  of 

TASTE 

The  ASSERT  Set  of  Tools  for  Engineering 

TINA 

Time  petri  Net  Analyzer 

TOPCASED 

Toolkit  in  OPen-source  for  Critical  Applications  and  SystEms  Development 

TOPLAS 

Transactions  on  Programming  Languages  and  Systems 

TRL 

Technical  Readiness  Level 

UK 

United  Kingdom 

UML 

Unified  Modeling  Language 

UNIX 

Uniplexed  Information  and  Computing  System  (was  UNICS) 

UPPAAL 

Uppsala  Universitet  and  Aalborg  University  Language 

VA 

Virginia 

VERSA 

Verification  Execution  and  Rewrite  System  for  ACSR  (Algebra  of  Com¬ 
municating  Shared  Resources). 
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Acronym 

Definition 

VHDL 

VHSIC  Hardware  Description  Language 

VHSIC 

Very  High  Speed  Integrated  Circuit 

VSIL 

Virtual  System  Integration  Laboratory 

vuv 

Virtual  Upgrade  Validation 

WG 

Working  Group 

XMI 

XML  Message  Interface 

XML 

Extensible  Markup  Language 
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cause  areas  and  proposes  a  framework  for  reliability  validation  and  improvement  that  integrates  several  recommended  technology  solu¬ 
tions:  validation  of  formalized  requirements:  an  architecture-centric,  model-based  engineering  approach  that  uncovers  system-level 
problems  early  through  analysis:  use  of  static  analysis  forvalidating  system  behaviorand  othersystem  properties:  and  managed  confi¬ 
dence  in  qualification  through  system  assurance.  This  framework  also  provides  the  basis  fora  set  of  metrics  for  cost-effective  reliability 
improvement  that  overcome  the  challenges  of  existing  software  complexity,  reliability,  and  costmetrics. 
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